general:
- reverted a matrix partitioning optimization from 0.3.30 that could lead to race conditions and subsequent invalid results in GEMM
- added the bfloat16 extensions BGEMM and BGEMV
- added a BLAS interface for the ?GEMM_BATCH extensions
- added the BLAS extensions ?GEMM_BATCH_STRIDED and their CBLAS interface
- added the basic infrastructure for half-precision float (FP16...