Compute intensive graph codegen
NV GPU
For CUDA platform, we have supported fusion of GEMM and its following element-wise ops (e.g. GELU, transpose), and then do codegen based on CUTLASS. Experiments show that this feature achieves up to 1.1x speedup for BERT model.
Please set DISC_ENABLE_COMPUTE_INTENSIVE_FUSE=true if you want to try this feature.
AArch64
We introduced MLIR Transf...