v0.2.6

What's Changed

fix: choose cuda arthitectures based on cuda version by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/463
kernel: add grouped gemm support for moe by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/458
kernel: added oob handling for grouped gemm kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/465
refactor: add _1 into stride for contiguous dim by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/466
ci: set cuda arch to native for ci workflows by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/467
refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/468
refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/469
fix: skip failed unittests for blackwell gpus by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/472
feat: added single tile scheduler for attn kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/473
feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/474
refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/475
feat: use global residue_mnk for oob handling by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/476
feat: simplify mask logic to avoid manual index computation by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/477
feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/478
feat: added gtest_main with filters based on compute_capabilities by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/479
ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/481
feat: add tma copy for paged kv by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/480
feat: added gather tma copy to control smem box size by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/482
feat: use aggressive compress-mode for fatbin by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/484
feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/485
feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/483
feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/486
refactor: move kernel code into different folders by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/487
feat: added KV multi-stages support for attn sm120 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/489
refactor: simplify mha block tiling logic by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/488
feat: added smem and gmem layout selector for attn kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/490
feat: added args and params for attn kernels by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/491
feat: added universal fmha runner by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/492
feat: added kernel builder for attn by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/493
refactor: change stride for Q/K/V to MNKL by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/494
upgrade torch to 12.8 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/496
ci: fix nccl related build error by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/497

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.2.5...v0.2.6

What's Changed

fix: choose cuda arthitectures based on cuda version by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/463

kernel: add grouped gemm support for moe by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/458

kernel: added oob handling for grouped gemm kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/465

refactor: add _1 into stride for contiguous dim by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/466

ci: set cuda arch to native for ci workflows by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/467

refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/468

refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/469

fix: skip failed unittests for blackwell gpus by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/472

feat: added single tile scheduler for attn kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/473

feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/474

refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/475

feat: use global residue_mnk for oob handling by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/476

feat: simplify mask logic to avoid manual index computation by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/477

feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/478

feat: added gtest_main with filters based on compute_capabilities by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/479

ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/481

feat: add tma copy for paged kv by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/480

feat: added gather tma copy to control smem box size by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/482

feat: use aggressive compress-mode for fatbin by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/484

feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/485

feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/483

feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/486

refactor: move kernel code into different folders by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/487

feat: added KV multi-stages support for attn sm120 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/489

refactor: simplify mha block tiling logic by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/488

feat: added smem and gmem layout selector for attn kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/490

feat: added args and params for attn kernels by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/491

feat: added universal fmha runner by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/492

feat: added kernel builder for attn by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/493

refactor: change stride for Q/K/V to MNKL by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/494

upgrade torch to 12.8 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/496

ci: fix nccl related build error by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/497

Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.2.5...v0.2.6

ScaleLLM

What's Changed

More C++ Projects

electron

bitcoin

googletest

carbon-lang

v0.2.6

What's Changed

More C++ Projects

electron

bitcoin

googletest

carbon-lang