New
v0.2.6
What's Changed
- fix: choose cuda arthitectures based on cuda version by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/463
- kernel: add grouped gemm support for moe by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/458
- kernel: added oob handling for grouped gemm kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/465
- refactor: add _1 into stride for contiguous dim by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/466
- ci: set cuda arch to native for ci workflows by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/467
- refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/468
- refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/469
- fix: skip failed unittests for blackwell gpus by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/472
- feat: added single tile scheduler for attn kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/473
- feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/474
- refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/475
- feat: use global residue_mnk for oob handling by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/476
- feat: simplify mask logic to avoid manual index computation by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/477
- feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/478
- feat: added gtest_main with filters based on compute_capabilities by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/479
- ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/481
- feat: add tma copy for paged kv by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/480
- feat: added gather tma copy to control smem box size by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/482
- feat: use aggressive compress-mode for fatbin by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/484
- feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/485
- feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/483
- feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/486
- refactor: move kernel code into different folders by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/487
- feat: added KV multi-stages support for attn sm120 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/489
- refactor: simplify mha block tiling logic by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/488
- feat: added smem and gmem layout selector for attn kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/490
- feat: added args and params for attn kernels by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/491
- feat: added universal fmha runner by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/492
- feat: added kernel builder for attn by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/493
- refactor: change stride for Q/K/V to MNKL by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/494
- upgrade torch to 12.8 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/496
- ci: fix nccl related build error by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/497
Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.2.5...v0.2.6