Blackwell kernel optimizations and MoE runner backend refactor
Overlap spec and prefill cuda graph support more models
What's Changed
[8/n] decouple quantization impl from vllm dependency - gguf srt by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11964
lang: support direct video inference by @mickqian in https://github.com/sgl-project/sglang/pull/9936
Enable Llama 4 + TRTLLM MHA by @b8zhong in https://github.com/sgl-project/sglang/pull/12003
Refactor Triton-kernel MoE runner integration by @Jonahcb in https://github.com/sgl-project/sglang/pull/11795
use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk by @b8zhong in https://github.com/sgl-project/sglang/pull/11816
Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 by @thelongestusernameofall in https://github.com/sgl-project/sglang/pull/11909
Revert "lang: support direct video inference" by @merrymercy in https://github.com/sgl-project/sglang/pull/12038
support more model in piecewise cuda graph by @narutolhy in https://github.com/sgl-project/sglang/pull/11745
[Fix] Fix lint to pass CI by @Fridge003 in https://github.com/sgl-project/sglang/pull/12037
Revert "[Fix] Fix lint to pass CI" by @Fridge003 in https://github.com/sgl-project/sglang/pull/12042
fix: fix MMMU loading issue by @ZailiWang in https://github.com/sgl-project/sglang/pull/11759
Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once by @xu-yfei in https://github.com/sgl-project/sglang/pull/10953
Add gguf dependency for cpu/xpu by @ZailiWang in https://github.com/sgl-project/sglang/pull/12041
fix: the hardcode hf repo name comparison for deepseek-ocr by @rainj-me in https://github.com/sgl-project/sglang/pull/12031
Install numactl in Dockerfile for GH200/GB200/GB300 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11853
[router] Add mTLS Support for Router-to-Worker Communication by @slin1237 in https://github.com/sgl-project/sglang/pull/12019
Tiny cleanup send_single by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12056
Refactoring GLM-4.5 and GLM-4.5V related implementations by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/11800
[Fix] fix missing ipc_name of __getitem__ in some IO structs by @whybeyoung in https://github.com/sgl-project/sglang/pull/12053
fix: bench_serving ITL calculation when using spec-decoding by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/12064
Fix dpsk-r1-fp4 launching crash by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/12063
Revise POINTSV15Chat model by @yuan-luo in https://github.com/sgl-project/sglang/pull/12049
Add 'gguf' to project dependencies by @Muqi1029 in https://github.com/sgl-project/sglang/pull/12046
[Profiler] expand '~' by @Muqi1029 in https://github.com/sgl-project/sglang/pull/11999
[b200] fix piecewise cuda graph launch bug by @BBuf in https://github.com/sgl-project/sglang/pull/12067
Fix multi processing serializer bug by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11958
[Fix]: HiCache hasher failed when EAGLE mode enabled by @leavelet in https://github.com/sgl-project/sglang/pull/12025
adjust dynamic vs static outputs comparison in test_lora_update.py by @glenliu21 in https://github.com/sgl-project/sglang/pull/11884
[router] implement response api get input item function and refactor input/output store by @key4ng in https://github.com/sgl-project/sglang/pull/11924
fix(compile_utils, ep_moe): update environment variable and dtype check by @ishandhanani in https://github.com/sgl-project/sglang/pull/12034
[router] fix ut router config init to use build pattern by @slin1237 in https://github.com/sgl-project/sglang/pull/12084
docs(server-arguments): add allowed options for each argument by @Jonahcb in https://github.com/sgl-project/sglang/pull/11560
[router] migrate app context to builder pattern 1/n by @slin1237 in https://github.com/sgl-project/sglang/pull/12086
[router] migrate app context to builder pattern 2/n by @slin1237 in https://github.com/sgl-project/sglang/pull/12089
[router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files by @CatherineSue in https://github.com/sgl-project/sglang/pull/12091
[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU by @zminglei in https://github.com/sgl-project/sglang/pull/12000
Fix: Update blog link by @LucaLow in https://github.com/sgl-project/sglang/pull/12071
perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph by @cicirori in https://github.com/sgl-project/sglang/pull/12093
[2/N]Support DeepSeek-R1 w4a8 low latency deepep by @ayrnb in https://github.com/sgl-project/sglang/pull/8464
Enhance tests in deterministic kernels by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12070
[Doc] Add documentation for DeepSeek V3.2 by @Fridge003 in https://github.com/sgl-project/sglang/pull/11877
[10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE by @ch-wan in https://github.com/sgl-project/sglang/pull/12054
Support true on-policy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12058
[Docs] update sgl-kernel readme by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11379
Fix 'KeyError' for per_token expert distribution recorder by @vipwangerxiao in https://github.com/sgl-project/sglang/pull/9501
Fix kernel version bump file by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12087
[Fix] Set global args in cpu test by @Fridge003 in https://github.com/sgl-project/sglang/pull/12105
chore: bump sgl-kernel version to 0.3.16.post4 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12103
[Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) by @merrymercy in https://github.com/sgl-project/sglang/pull/12083
[router] Refactor data connector architecture with unified storage modules by @key4ng in https://github.com/sgl-project/sglang/pull/12096
fix: release workflow should work on both archs by @ishandhanani in https://github.com/sgl-project/sglang/pull/12110
[bugs] docker file name should be .Dockerfile so it can properly render by @slin1237 in https://github.com/sgl-project/sglang/pull/11869
Clean up server args & Add CI scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/12124
[Misc] Improve the error message of failed import by @DarkSharpness in https://github.com/sgl-project/sglang/pull/12119
[CI] Add ci monitor balance workflow by @BBuf in https://github.com/sgl-project/sglang/pull/11962
Skip TestLlama4LoRA in CI by @lifuhuang in https://github.com/sgl-project/sglang/pull/12098
clean up github tokens by @merrymercy in https://github.com/sgl-project/sglang/pull/12126
Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation by @YAMY1234 in https://github.com/sgl-project/sglang/pull/12115
Fix token for CI monitor by @merrymercy in https://github.com/sgl-project/sglang/pull/12127
Reenable b200 tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/11814
Update document index for DeepSeek-v32 docs by @Fridge003 in https://github.com/sgl-project/sglang/pull/12101
Update sgl-kernel version to 0.3.16.post4 by @Fridge003 in https://github.com/sgl-project/sglang/pull/12125
[Doc] Fix format for deepseek v3.2 document by @Fridge003 in https://github.com/sgl-project/sglang/pull/12130
Accelerate deepseek fp4 b200 ci by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11993
Clean up server launch code and multi tokenizer by @merrymercy in https://github.com/sgl-project/sglang/pull/12132
[Test] Add dsv3.2 nsa backend testing by @Johnsonms in https://github.com/sgl-project/sglang/pull/11936
[docs] upd docker files names everywhere by @vincentzed in https://github.com/sgl-project/sglang/pull/12133
Make bmm batch invariant injection optional by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12118
[Doc] Small update of DeepSeek v3.2 document by @Fridge003 in https://github.com/sgl-project/sglang/pull/12138
docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/12139
[router] MCP Manager - Support Connection Pooling, Tool Inventory and Proxy by @slin1237 in https://github.com/sgl-project/sglang/pull/12097
[NVIDIA] Change default quant method for model_opt by @kaixih in https://github.com/sgl-project/sglang/pull/11991
[router] update smg code owners for each component by @slin1237 in https://github.com/sgl-project/sglang/pull/12141
[router] cleaned up all the redundant comments in the config module by @CatherineSue in https://github.com/sgl-project/sglang/pull/12147
Clean up attention backend selection code & Other minor rename by @merrymercy in https://github.com/sgl-project/sglang/pull/12136
[log] Make forward iter count optional by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12116
[misc] depdencies & enviroment flag by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12113
[quantization] AWQ Marlin doesn't work when dtype is bfloat16 by @kevin85421 in https://github.com/sgl-project/sglang/pull/11494
[HiCache]Page head layout IO kernel by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/11615
Do not use MagicMock to mock server_args in tests by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12154
[router][grpc] Fix tool call id in parse_json_schema_response by @CatherineSue in https://github.com/sgl-project/sglang/pull/12152
[router] centralize mcp tool args handling by @slin1237 in https://github.com/sgl-project/sglang/pull/12155
Fix ITL metrics when using openai endpoint with spec by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12156
[Fix] fix allreduce bug in Piecewise Graph by @zyksir in https://github.com/sgl-project/sglang/pull/12106
Support DeepGEMM for deterministic inference by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12142
model: support NVILA and NVILA Lite by @futrime in https://github.com/sgl-project/sglang/pull/10399
Avoid using flashinfer_allreduce_fusion when dp attention is enabled. by @elfiegg in https://github.com/sgl-project/sglang/pull/11632
transfer mrope_position_delta to device when first running by @ash-sigh in https://github.com/sgl-project/sglang/pull/11047
add gitignore for claude code and serena mcp by @slin1237 in https://github.com/sgl-project/sglang/pull/12166
Support MiniMax M2 model by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/12129
[misc][grpc] Remove duplicate log by @CatherineSue in https://github.com/sgl-project/sglang/pull/12168
[router][grpc] Add ResponsesContext and fix error propagation in responses api by @CatherineSue in https://github.com/sgl-project/sglang/pull/12164
[router] Remove SharedXxxStorage type aliases to make Arc explicit by @CatherineSue in https://github.com/sgl-project/sglang/pull/12171
Remove deprecated --enable-beta-spec argument and fix b200 test by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12167
fix broken deepep/flashmla install in container by adding --no-build-isolation by @ishandhanani in https://github.com/sgl-project/sglang/pull/12170
Remove description for --enable-beta-spec argument by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/12177
chore: bump SGLang version to 0.5.4.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12169
[doc] add example of using w4fp8 for Deepseek by @Kevin-XiongC in https://github.com/sgl-project/sglang/pull/12057
[sgl-route] Optimize the use of constant slices and retain to simplif… by @lengrongfu in https://github.com/sgl-project/sglang/pull/12159
[Fix] Fix cu130 sgl-kernel wheel renaming by @Fridge003 in https://github.com/sgl-project/sglang/pull/12173
docs: update contact by @zhyncs in https://github.com/sgl-project/sglang/pull/12192
[sgl-kernel] feat: Support sm120 cutlass fp8 gemm kernel by @kaln27 in https://github.com/sgl-project/sglang/pull/9403
[sgl-kernel][4/N]Support Expert Specialization Grouped GEMM by @HydraQYH in https://github.com/sgl-project/sglang/pull/12080
GLM-4-0414 and GLM-4.1V Code Refactor by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/12117
Add support for AutoRound quantized models by @WeiweiZhang1 in https://github.com/sgl-project/sglang/pull/10153
Optimize triton_mrope with torch compile by @yuan-luo in https://github.com/sgl-project/sglang/pull/12112
Fix crash after flush cache by @cctry in https://github.com/sgl-project/sglang/pull/12107
[Detokenizer Manager] Cleanup state when reqs are finished by @Muqi1029 in https://github.com/sgl-project/sglang/pull/12205
fix(metrics): double times add_latency for DECODE_BOOTSTRAP by @jinmingyi1998 in https://github.com/sgl-project/sglang/pull/12209
improve mimax-m2 rmsnorm precision by @haichao592 in https://github.com/sgl-project/sglang/pull/12186
check_offload_progress more frequently by @pansicheng in https://github.com/sgl-project/sglang/pull/11656
[Feature] PD-Multiplexing Context and Scheduler. by @ykcombat in https://github.com/sgl-project/sglang/pull/11592
rope xpu: fix missing argument 'fused_set_kv_buffer_arg' and replace native with sgl_kernel_xpu impl by @chunyuan-w in https://github.com/sgl-project/sglang/pull/12006
Add support for Matryoshka embeddings (#126) by @satyamk7054 in https://github.com/sgl-project/sglang/pull/11142
fix: AttributeError: 'NixlKVManager' object has no attribute 'prefill_tp_size_table' by @gongwei-130 in https://github.com/sgl-project/sglang/pull/12234
Compiling rope while preserving true on policy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12161
[Auto Sync] Update scheduler.py, spec_info.py, run_suite.py... (20251027) by @zhyncs in https://github.com/sgl-project/sglang/pull/12235
Support running FP4 Deepseek on SM120. by @weireweire in https://github.com/sgl-project/sglang/pull/11708
Add env var to control custom Triton kernel cache and set CSGMV as default backend. by @lifuhuang in https://github.com/sgl-project/sglang/pull/12176
Use explicit uint64 dtype for Tensor data_ptr() to avoid overflow by @jianan-gu in https://github.com/sgl-project/sglang/pull/11994
Update openai package version to 2.6.1 by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/12222
[2/2] Use moe_sum_reduce cuda kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/10654
docker: add CUDA13 support in dockerfile and update GDRCopy/NVSHMEM for blackwell support by @ishandhanani in https://github.com/sgl-project/sglang/pull/11517
[router] remove code duplication by @slin1237 in https://github.com/sgl-project/sglang/pull/12245
[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache by @hlu1 in https://github.com/sgl-project/sglang/pull/11655
Add per-request retraction count by @scottjlee in https://github.com/sgl-project/sglang/pull/11177
Opt fused triton moe: add tma for down proj kernel by @xu-yfei in https://github.com/sgl-project/sglang/pull/10567
Support releasing CUDA graph memory when paused by @fzyzcjy in https://github.com/sgl-project/sglang/pull/7873
[router] use mcp struct from sdk and clean up code across codebase by @slin1237 in https://github.com/sgl-project/sglang/pull/12249
[router] configure workflow retries and timeout based on routerConfig by @slin1237 in https://github.com/sgl-project/sglang/pull/12252
Feature/Add GET endpoint to query loaded LoRA adapters by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/12229
[hotfix] Incorrect CombineOverlapArgs in SBO by @ch-wan in https://github.com/sgl-project/sglang/pull/12230
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 by @sufeng-buaa in https://github.com/sgl-project/sglang/pull/10804
[Bug fix] [PP] fix wrong dtype for quantified model by @XucSh in https://github.com/sgl-project/sglang/pull/12247
Fix potential eos bug on decode instance when PD is enabled by @ShangmingCai in https://github.com/sgl-project/sglang/pull/12206
Revert "[Feature] PD-Multiplexing Context and Scheduler." by @zhyncs in https://github.com/sgl-project/sglang/pull/12267
chore: cleanup quant deps by @zhyncs in https://github.com/sgl-project/sglang/pull/12268
[router] Fix type unmatch during validation by @key4ng in https://github.com/sgl-project/sglang/pull/12257
Modify rocm.Dockerfile by @sogalin in https://github.com/sgl-project/sglang/pull/12274
[router] upgrade grpc dependency and py 3.13 3.14 support by @slin1237 in https://github.com/sgl-project/sglang/pull/12284
Fix 'BypassedTopKOutput' object has no attribute 'topk_weights' for DeepEP by @trevor-m in https://github.com/sgl-project/sglang/pull/12231
Tiny fix sgl-kernel related CI installing the wrong binary by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12283
doc for logit_bias by @whybeyoung in https://github.com/sgl-project/sglang/pull/12188
Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend by @b8zhong in https://github.com/sgl-project/sglang/pull/11928
[rust][ci] Add end-to-end tests for Oracle history backend by @key4ng in https://github.com/sgl-project/sglang/pull/12233
[router] support arm, windows, mac, linux, reduce wheel size and number by @slin1237 in https://github.com/sgl-project/sglang/pull/12285
fix seqlen bug for trtllm_mla's draft_extend by @bmac3 in https://github.com/sgl-project/sglang/pull/12295
Update deepseek_v32.md by @hlu1 in https://github.com/sgl-project/sglang/pull/12296
Super tiny fix expert distribution dump error by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12271
[router][grpc] Fix inconsistent behavior of conversation_id not found by @CatherineSue in https://github.com/sgl-project/sglang/pull/12299
fix: Llama 4 BF16 load on Blackwell by @b8zhong in https://github.com/sgl-project/sglang/pull/12308
Add continuous_usage_stats support for streaming responses by @BBuf in https://github.com/sgl-project/sglang/pull/12241
[hotfix] missing w13_weight_fp8 and w2_weight_fp8 in UE8M0 requantization by @ch-wan in https://github.com/sgl-project/sglang/pull/12259
[hotfix] Fix pytest not found in CI by @Fridge003 in https://github.com/sgl-project/sglang/pull/12311
a tiny fix for support deepseek bf16 weights by @Gao016 in https://github.com/sgl-project/sglang/pull/12313
[metrics][EPLB]: Support selected count of physical experts on each GPU by @acelyc111 in https://github.com/sgl-project/sglang/pull/9825
doc: improve modelopt error description by @lianakoleva in https://github.com/sgl-project/sglang/pull/12269
EPLB: prefer to use physical experts in the same gpu or node by @acelyc111 in https://github.com/sgl-project/sglang/pull/10874
Add Batch‑Invariant RMSNorm by @zyzshishui in https://github.com/sgl-project/sglang/pull/12144
followup fix for llama 4 trtllm flashinfer backend by @b8zhong in https://github.com/sgl-project/sglang/pull/12314
[Deepseek V3.2] Enable flashmla_auto with MTP by @hlu1 in https://github.com/sgl-project/sglang/pull/12294
feat: preview filename from tuning_fused_moe_triton.py by @lianakoleva in https://github.com/sgl-project/sglang/pull/12276
[ci] Try fixing broken CIs by @Fridge003 in https://github.com/sgl-project/sglang/pull/12317
Refactor abortion in event loop by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12312
[Test] Fix session control test by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12336
Eagle3 DP attention for Qwen3 MoE by @qhsc in https://github.com/sgl-project/sglang/pull/12002
feat: return partial generation results when aborting requests in waiting queue by @guoyuhong in https://github.com/sgl-project/sglang/pull/11673
[Bug fix] trace: fix import error in mini_lb if sgl-router image does not install sglang by @sufeng-buaa in https://github.com/sgl-project/sglang/pull/12338
[router] fix router release workflow and add build test in PR by @CatherineSue in https://github.com/sgl-project/sglang/pull/12315
Triton fused_moe_kernel support ep moe tuning by @BBuf in https://github.com/sgl-project/sglang/pull/12343
[Fix] fix type issue of env flag value MODELOPT_MAX_TOKENS_PER_EXPERT by @zejunchen-zejun in https://github.com/sgl-project/sglang/pull/11709
[bug] fix router pypi license file by @slin1237 in https://github.com/sgl-project/sglang/pull/12345
fix: llama 4 + trtllm gen + fp8 kv cache incompatibility by @b8zhong in https://github.com/sgl-project/sglang/pull/12347
[2/2] Deepseek deterministic: support deepseek v3 deterministic inference on 8 x H200 by @zminglei in https://github.com/sgl-project/sglang/pull/12095
Fix Flashinfer Backend for SM120 Usage by @weireweire in https://github.com/sgl-project/sglang/pull/12325
[router] refactor mcp to use LRU and fix pooling bug by @CatherineSue in https://github.com/sgl-project/sglang/pull/12346
support cutlass fp4 kernel in sm120 by @AichenF in https://github.com/sgl-project/sglang/pull/11737
[bug] fix router installation to include additional dependency by @slin1237 in https://github.com/sgl-project/sglang/pull/12348
[router] update router docker to use maturin and build from local by @CatherineSue in https://github.com/sgl-project/sglang/pull/12350
Fix Duplicate Classmethod in spec_info.py by @hebiao064 in https://github.com/sgl-project/sglang/pull/12354
[CI] Add Llama 3.1 8B FP4 to B200 CI by @b8zhong in https://github.com/sgl-project/sglang/pull/12182
Fuse wk and weight_proj in Indexer for DeepSeekV3.2-FP4 by @trevor-m in https://github.com/sgl-project/sglang/pull/12094
[router] Harmony Pipeline: Chat Completion & Responses API with MCP Support by @slin1237 in https://github.com/sgl-project/sglang/pull/12153
[bugfix] fix deepseekvl2 and deepseek_ocr model type conflict by @leihuang-sketch in https://github.com/sgl-project/sglang/pull/12050
[Ckpt Engine] feat: new sglang entrypoint support for update by @stmatengss in https://github.com/sgl-project/sglang/pull/12216
[Perf] Optimize multimodal mm_inputs process in scheduler by @yuan-luo in https://github.com/sgl-project/sglang/pull/11910
[NPU] fix pp_size>1 by @Makcum888e in https://github.com/sgl-project/sglang/pull/12195
Super tiny add tag for benchmark scripts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12340
Allow benchmarking tool to handle empty response by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12174
Super tiny fix AMD ci by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12378
Import flash_mla from sgl-kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/12135
[Bug fix][PP] fix deadlock with tie_word_embeddings by @XucSh in https://github.com/sgl-project/sglang/pull/12362
[fix] added image token as prefix for deepseek-ocr by @Tushar-ml in https://github.com/sgl-project/sglang/pull/12358
Fix DeepSeek chat templates to handle tool call arguments type checking (#11700) by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12123
[Feature] Initial eagle3 support for Deepseek-like models by @JensenFire in https://github.com/sgl-project/sglang/pull/12319
Enable fast silu-and-mul-and-quant fused kernel by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11806
[Test] Enhance radix cache test for spec cases by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12394
[NPU] bugfix for Qwen3-Next and performance update by @iforgetmyname in https://github.com/sgl-project/sglang/pull/11969
[Feature] Support DeepSeek MTP on NPU by @iforgetmyname in https://github.com/sgl-project/sglang/pull/11897
Revert "Triton fused_moe_kernel support ep moe tuning" by @BBuf in https://github.com/sgl-project/sglang/pull/12377
[sgl-kernel] upd deepgemm hash to rebased commit by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11960
[router] harmony responses api streaming support by @slin1237 in https://github.com/sgl-project/sglang/pull/12395
[docker] clean up main dockerfile for router and dev configurations by @CatherineSue in https://github.com/sgl-project/sglang/pull/12364
feat: add EP support in tuning by @Chen-0210 in https://github.com/sgl-project/sglang/pull/12012
[router] use safety_identifier replace user on chat history storage by @lengrongfu in https://github.com/sgl-project/sglang/pull/12185
[CI Monitor] Fix ci_monitor perf analyzer bug by @BBuf in https://github.com/sgl-project/sglang/pull/12281
[router] Fix safety_identifier missing by @key4ng in https://github.com/sgl-project/sglang/pull/12404
[ci] Fix ci_install_deepep by @Fridge003 in https://github.com/sgl-project/sglang/pull/12375
Update news section in README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/12409
[router] Function call support for openai router Responses API by @key4ng in https://github.com/sgl-project/sglang/pull/12386
minor code sync by @merrymercy in https://github.com/sgl-project/sglang/pull/12403
[Bug fix][PD Dissaggregation] fix prefill hanging issue with PP and DP Attention, by @popsiclexu in https://github.com/sgl-project/sglang/pull/12368
[NVIDIA] Add CI workloads for GB200 by @kaixih in https://github.com/sgl-project/sglang/pull/12242
[router] web_search_preview tool basic implementation by @key4ng in https://github.com/sgl-project/sglang/pull/12290
[router] 0.2.2 release by @slin1237 in https://github.com/sgl-project/sglang/pull/12399
enable cudaProfilerApi for one batch benchmarking by @lpc0220 in https://github.com/sgl-project/sglang/pull/11116
[Refactor] tuning_fused_moe for MLLM and small refactor by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11224
[DeepSeekV32] Bug fix to ensure page_table and result in same type by @Johnsonms in https://github.com/sgl-project/sglang/pull/12300
[CI] fix tests' time estimation by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12401
Reserved abortion API when retracting by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12425
Fix the shared expert & routed expert overlap in Llama 4 by @b8zhong in https://github.com/sgl-project/sglang/pull/12405
feat: Add Non-intrusive Tensor Dumping for Model Inference by @guoyuhong in https://github.com/sgl-project/sglang/pull/10566
feat: support trtllm_mha FP8 query attention kernel by @elvischenv in https://github.com/sgl-project/sglang/pull/12307
[Bugfix]: distinguish processors for deepseek_vl2 and deepseek_ocr to p… by @bppps in https://github.com/sgl-project/sglang/pull/12384
[ci] install released version router by @key4ng in https://github.com/sgl-project/sglang/pull/12410
Revert "fix llama4 kv cache layout" by @b8zhong in https://github.com/sgl-project/sglang/pull/12437
Add trait for BasePrefixCache by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12436
[CI] Add more bins for 1-gpu CI test by @Fridge003 in https://github.com/sgl-project/sglang/pull/12422
[bugfix] set is_prefill_only=false when mixed_chunk by @Bruce-x-1997 in https://github.com/sgl-project/sglang/pull/10889
Clean up sgl kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/12413
[CI] fix possible port conflicts. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12452
Fix ci install to allow prerelease by @merrymercy in https://github.com/sgl-project/sglang/pull/12449
fix: Add default value for backend in sample_mmmu_requests by @ZailiWang in https://github.com/sgl-project/sglang/pull/12256
Enable bailing_moe to support TP=16 by @guoyuhong in https://github.com/sgl-project/sglang/pull/12369
fix:watchdog thread exception by @Kindyaa in https://github.com/sgl-project/sglang/pull/12328
Simplify watchdog by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12463
[Bug fix] Fix severe memory waste issue with torch.empty pin_memory by @sjtushenhai in https://github.com/sgl-project/sglang/pull/12266
Feat: deepseek-ocr logits processor by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/12415
Fix lint in deepseek-ocr by @ispobock in https://github.com/sgl-project/sglang/pull/12470
[Test] Add Functional Tests for Penalty Parameters by @neelabhsinha in https://github.com/sgl-project/sglang/pull/11931
[Bug] OOM (Out-of-Memory) errors for extreme testing scenarios (min_tokens=2) by @LuYanFCP in https://github.com/sgl-project/sglang/pull/11757
[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. by @ykcombat in https://github.com/sgl-project/sglang/pull/12275
[VLM] Optimize async mm data process mechanism by @yuan-luo in https://github.com/sgl-project/sglang/pull/12066
fix default env var for mooncake store by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/12429
add served model name in bench serving by @carolove in https://github.com/sgl-project/sglang/pull/12428
Tiny assert no running requests when releasing memory to avoid IMA by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12341
fix: dummy health check server not accessible on non-zero rank nodes by @ishandhanani in https://github.com/sgl-project/sglang/pull/12297
Fix run benchmark by @ispobock in https://github.com/sgl-project/sglang/pull/12473
Add env var to disable FA4 warmup by @cicirori in https://github.com/sgl-project/sglang/pull/12430
Try to allow NCCL cumem for multi node nvlink case by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11987
Support Kimi Linear by @ispobock in https://github.com/sgl-project/sglang/pull/12469
[CI] Fix kernel installation on aarch runners by @Fridge003 in https://github.com/sgl-project/sglang/pull/12475
fa3 & trtllm_mha spec overlap by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11874
chore: bump SGLang version to 0.5.4.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12439
Tiny fix eos handling for PD disaggregation by @ShangmingCai in https://github.com/sgl-project/sglang/pull/12334
Forward unknown tool calls instead of dropping by @Surya-Gunukula in https://github.com/sgl-project/sglang/pull/12226
Use sgl fp4 quant kernel by default by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/12482
[hot fix] Remove from python.sglang.xxx by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12483
perf: trtllm mla performance minor improvements by @cicirori in https://github.com/sgl-project/sglang/pull/12435
Filter tokenizer warning for kimi models by @ispobock in https://github.com/sgl-project/sglang/pull/12485
[CI] Build aarch64 kernels for sgl-kernel test by @Fridge003 in https://github.com/sgl-project/sglang/pull/12480
[Hotfix] Remove extra comment in sgl-kernel README by @Fridge003 in https://github.com/sgl-project/sglang/pull/12500
[feat] Add SGLANG_TOOL_STRICT_LEVEL for tool-call behavior control by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/12423
Reduce docker image size. mount cache when use pip/cargo build by @whybeyoung in https://github.com/sgl-project/sglang/pull/12238
[HICache / PD]: Support offloading incremental KV cache in decode side. by @hzh0425 in https://github.com/sgl-project/sglang/pull/11966
[Deterministic] add deepseek v3 deterministic inference CI test by @zminglei in https://github.com/sgl-project/sglang/pull/12412
[Bug] test_flashattn_mla_backend errors in Hopper #12487 by @Johnsonms in https://github.com/sgl-project/sglang/pull/12488
Update Mooncake EP's a2a interface by @UNIDY2002 in https://github.com/sgl-project/sglang/pull/12391
[CI][NPU] remove pypi mirror site that hangs ci dependency installation by @iforgetmyname in https://github.com/sgl-project/sglang/pull/12499
[Ascend] Add Ascend NPU support for sglang.check_env & rework proposal by @Alexhaoge in https://github.com/sgl-project/sglang/pull/11052
[Feature] Qwen3-Next & FLA: Support MTP topk>1; Up to 6% faster by @byjiang1996 in https://github.com/sgl-project/sglang/pull/11133
[CI] Move some Lora/Deterministic CI tests to nightly by @Fridge003 in https://github.com/sgl-project/sglang/pull/12507
Migrate weak_ref_tensor to sgl-kernel by @BBuf in https://github.com/sgl-project/sglang/pull/12505
feat: Add FP4 (E2M1) KV Cache Support with Quantization Utilities for MLA by @JackChuang in https://github.com/sgl-project/sglang/pull/10078
chore: bump sgl-kernel version to 0.3.16.post5 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12511
[FEAT] Shared mem pool based cuda ipc for multi-modal data transport by @kousakawang in https://github.com/sgl-project/sglang/pull/11917
Add prefix for torch symm mem by @yuan-luo in https://github.com/sgl-project/sglang/pull/12506
[ServerArgs] allow --mamba-ssm-dtype extend by @hanming-lu in https://github.com/sgl-project/sglang/pull/12481
[Fix] concat_mla_absorb_q_kernel fails for long inputs by @bingps in https://github.com/sgl-project/sglang/pull/12453
Super tiny fix naming in bench serving scripts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12515
move all get_stream in sgl_kernel to c++ to reduce the launch overhead by @merrymercy in https://github.com/sgl-project/sglang/pull/12521
[Refact] Remove hardcoded KV cache dimension in MLATokenToKVPool by @Johnsonms in https://github.com/sgl-project/sglang/pull/12502
[Bug] Fix Intern-S1 model accuracy and support /generate interface with input_ids by @hhaAndroid in https://github.com/sgl-project/sglang/pull/12367
chore: upgrade flashinfer 0.5.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/12523
[hotfix] Remove flashinfer-jit-cache from pyproject by @Fridge003 in https://github.com/sgl-project/sglang/pull/12530
fix: move dummy format loader check before quantization checks by @cicirori in https://github.com/sgl-project/sglang/pull/12532
chore: upgrade mooncake 0.3.7.post1 by @ShangmingCai in https://github.com/sgl-project/sglang/pull/12541
fix: Fix KTransformers hybrid inference with int8 quantization and format by @Atream in https://github.com/sgl-project/sglang/pull/12536
Conditionally recapture cuda graph after model weight update from disk by @harrisonlimh in https://github.com/sgl-project/sglang/pull/12060
[spec v2] Fix output repetition by speculative sampling error by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12561
[hot-fix] Fix broken CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12564
fix: fix the bug which leads qwen2_5_vl to crash with mixed_chunk by @PanJason in https://github.com/sgl-project/sglang/pull/11330
Fix error when calling quantization by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12548
[Test] Add parameters to SRTRunner by @Jonahcb in https://github.com/sgl-project/sglang/pull/12227
[ROCm] Update Mooncake to v0.3.7.post1 and add -DUSE_HIP=ON to rocm.Dockerfile by @yeahdongcn in https://github.com/sgl-project/sglang/pull/12560
Reduce the overhead of nccl symmetric memory by @merrymercy in https://github.com/sgl-project/sglang/pull/12524
tiny optimize for bench serving by @yizhang2077 in https://github.com/sgl-project/sglang/pull/12553
Super tiny allow profile activities in bench_serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12549
Super tiny dump server info such as args in bench for post analysis by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12550
update usage of trtllm_fp8_per_tensor_scale_moe by @b8zhong in https://github.com/sgl-project/sglang/pull/12569
[router][grpc] Consolidate error messages build in error.rs by @CatherineSue in https://github.com/sgl-project/sglang/pull/12301
Remove the dependency of nccl.h in symmetric memory by @merrymercy in https://github.com/sgl-project/sglang/pull/12571
[chore] Fix update_kernel_whl_index script for multiple cuda version by @Fridge003 in https://github.com/sgl-project/sglang/pull/12519
Enable mixed type LayerNorm kernel for NSA indexer by @akhilg-nv in https://github.com/sgl-project/sglang/pull/12044
Super tiny add UT for copy_to_gpu_no_ce by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12270
[Doc] fix miss index for production request trace by @stmatengss in https://github.com/sgl-project/sglang/pull/12547
[GDN/SWA] mamba and swa radix cache edge case fix by @hanming-lu in https://github.com/sgl-project/sglang/pull/12111
[Qwen3 VL] Add LoRA support for Qwen 3 VL by @Jonahcb in https://github.com/sgl-project/sglang/pull/12165
test: support return logprobs in bench_offline_throughput test by @aftersnow in https://github.com/sgl-project/sglang/pull/12462
Tiny fix ExpertDistributionReq error by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11760
fix: respect --ignore-eos in PD case for benchmarking by @ishandhanani in https://github.com/sgl-project/sglang/pull/12597
Improve the metrics for PD by @merrymercy in https://github.com/sgl-project/sglang/pull/12580
Enable memory saver for hybrid model by @ocss884 in https://github.com/sgl-project/sglang/pull/11974
Restore torch defaults between sgl-kernel tests by @benbarsdell in https://github.com/sgl-project/sglang/pull/11131
feat: limit peak memory usage when computing logprobs by @aftersnow in https://github.com/sgl-project/sglang/pull/6318
[router][grpc] Restructure modules and code clean up by @CatherineSue in https://github.com/sgl-project/sglang/pull/12598
Add --speculative-moe-runner-backend server arg by @trevor-m in https://github.com/sgl-project/sglang/pull/10183
[Deterministic] Optimize bmm_batch_invariant op by @zminglei in https://github.com/sgl-project/sglang/pull/12522
chore: bump mooncake version to 0.3.7.post2 by @ShangmingCai in https://github.com/sgl-project/sglang/pull/12599
[sepc-v2] Fix imcompatibility with constrained decoding by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12615
Support aggregating engine metrics in sgl-router by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11456
Ensure GPU work is finished when release memory occupation call is finished by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12592
Add sanity checks when a test file is not added to CI (reland) by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12594
[router][grpc] Fix model validation, tool call check, streaming logic and misc in responses by @CatherineSue in https://github.com/sgl-project/sglang/pull/12616
[HotFix] Disable torch dynamo for mrope_triton kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/12593
Fix skip layer in get_quant_method by @ispobock in https://github.com/sgl-project/sglang/pull/12632
[Test] Merge all constrained decoding tests. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12633
Add io struct naming check back by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12634
Fix output_ids inconsistency by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12628
fix: Lazy import mooncake-ep to fix extra gpu contexts being created by @trevor-m in https://github.com/sgl-project/sglang/pull/12641
[hotfix] Fix deepep w4a8 bug by @Fridge003 in https://github.com/sgl-project/sglang/pull/12642
[Auto Sync] Update scheduler_metrics_mixin.py, collector.py (20251104) by @merrymercy in https://github.com/sgl-project/sglang/pull/12647
[Bug] Fix NSA Backend KV-Buffer Shape Mismatch in DeepSeek-V3.2 by @Johnsonms in https://github.com/sgl-project/sglang/pull/12645
[NVIDIA] Fix wrong symmetric sizes for fp4 cases by @kaixih in https://github.com/sgl-project/sglang/pull/12640
[router][grpc] Fix index issues in reasoning content and missing streaming events by @CatherineSue in https://github.com/sgl-project/sglang/pull/12650
Revert "Enable memory saver for hybrid model" by @Fridge003 in https://github.com/sgl-project/sglang/pull/12648
Add multi-GPU configurations to nightly-test.yml by @alisonshao in https://github.com/sgl-project/sglang/pull/12585
[fix] Handle escaped characters in GLM tool call parser to prevent double serialization by @soaringk in https://github.com/sgl-project/sglang/pull/12456
[router][grpc] Emit OutputItemDone event and store output item array by @CatherineSue in https://github.com/sgl-project/sglang/pull/12656
Register allgather/reducescatter buffers with symm memory by @nvcastet in https://github.com/sgl-project/sglang/pull/12572
chore: bump SGLang version to 0.5.4.post3 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12639
[NVIDIA] Fix cutedsl backend of MoE by @kaixih in https://github.com/sgl-project/sglang/pull/12353
[PD-Disagg] Check finish after pop tranferred by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12638
fix typo of args description in sglang.profiler by @ai-easy-cpu in https://github.com/sgl-project/sglang/pull/12486
[Dockerfile] Speed up docker image building by @acelyc111 in https://github.com/sgl-project/sglang/pull/8784
Fix VLLM dependency test by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12670
[Feature] add --lora-request-distribution arg to bench_serving.py and support skewed and distinct workloads by @glenliu21 in https://github.com/sgl-project/sglang/pull/12175
[router][grpc] Implement tool_choice support for Responses API by @CatherineSue in https://github.com/sgl-project/sglang/pull/12668
Expand and update test coverage for AMD CI by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/10044
fix: add seed bench_serving to cache key, remove redundant function definition. by @cicirori in https://github.com/sgl-project/sglang/pull/12680
[Profiler] Add SGLANG_PROFILE_RECORD_SHAPES for recording shapes when profiling by @zejunchen-zejun in https://github.com/sgl-project/sglang/pull/11641
fix trtllm_mla attention backend when disabling cuda graph. by @cicirori in https://github.com/sgl-project/sglang/pull/12687
Refactor --debug-tensor-dump-layers to list by @guoyuhong in https://github.com/sgl-project/sglang/pull/12691
[Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/12455
add Kimi k2 reasoning parser by @MoyanZitto in https://github.com/sgl-project/sglang/pull/12702
Commented out b200 tests due to runner shortage by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12609
[CI] Fix qwen3-vl lora nightly ci by @Fridge003 in https://github.com/sgl-project/sglang/pull/12708
Fix server args for gpt oss so users can override the moe runner backend by @merrymercy in https://github.com/sgl-project/sglang/pull/12696
[router][grpc] Support streaming parsing with Tool Choice in chat completions API by @CatherineSue in https://github.com/sgl-project/sglang/pull/12677
feat: initial multimodal-gen support by @mickqian in https://github.com/sgl-project/sglang/pull/12484
Enable Aiter Attention for VL model by @Yuechguo in https://github.com/sgl-project/sglang/pull/12699
[router] fix: validate HTTP status codes in health check by @wyx-0203 in https://github.com/sgl-project/sglang/pull/12631
Support Expert Deferral Mechanism in KTransformers by @Atream in https://github.com/sgl-project/sglang/pull/12586
Add mm_fp4 trtllm backend by @wenscarl in https://github.com/sgl-project/sglang/pull/12406
[NVIDIA] Fix unit test of MoE and add it to nightly ci by @kaixih in https://github.com/sgl-project/sglang/pull/12709
[misc] Add labeler for automatic labeling by @CatherineSue in https://github.com/sgl-project/sglang/pull/12710
[router][ci] speed up python binding to 1.5 min by @key4ng in https://github.com/sgl-project/sglang/pull/12673
Fix CI and style by @merrymercy in https://github.com/sgl-project/sglang/pull/12658
Revert "Commented out b200 tests due to runner shortage (#12609)" by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12712
[misc] Change sync-labels to false by @CatherineSue in https://github.com/sgl-project/sglang/pull/12714
[router][grpc] Make harmony parser checks recipient first before channel by @CatherineSue in https://github.com/sgl-project/sglang/pull/12713
[router][quick fix] Add minimal option for reasoning effort in spec by @key4ng in https://github.com/sgl-project/sglang/pull/12711
[router] add basic ci tests for gpt-oss model support by @key4ng in https://github.com/sgl-project/sglang/pull/12651
fix labeler by @key4ng in https://github.com/sgl-project/sglang/pull/12718
[ci] fix permission by @key4ng in https://github.com/sgl-project/sglang/pull/12729
[chore]Remove dockerfile from target file of bump kernel version by @Fridge003 in https://github.com/sgl-project/sglang/pull/12728
[CPU] Upgrade default PT version to 2.9 by @ZailiWang in https://github.com/sgl-project/sglang/pull/12611
Revert "[ci] fix permission" by @key4ng in https://github.com/sgl-project/sglang/pull/12732
Revert "[router] web_search_preview tool basic implementation" by @key4ng in https://github.com/sgl-project/sglang/pull/12716
fix sgl-kernel version by @gongwei-130 in https://github.com/sgl-project/sglang/pull/12723
[chore] SGLang tag management in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/12734
Add nightly test multi gpu configs by @alisonshao in https://github.com/sgl-project/sglang/pull/12721
DeepSeek-V3.2: Add Adaptive MHA Attention Pathway for Short-Sequence Prefill by @YAMY1234 in https://github.com/sgl-project/sglang/pull/11892
Temporarily fix missing routed_scaling_factor for CompressedTensorsWNA16MoEMethod by @Atream in https://github.com/sgl-project/sglang/pull/12738
[chore] Fix triton installation for cu13 image by @Fridge003 in https://github.com/sgl-project/sglang/pull/12742
keep attention backend document up to date by @b8zhong in https://github.com/sgl-project/sglang/pull/12741
[Fix]Tiny fix in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/12748
[router][grpc] Support mixin tool calls in Responses API by @CatherineSue in https://github.com/sgl-project/sglang/pull/12736
fix: tiny fix cli by @mickqian in https://github.com/sgl-project/sglang/pull/12744
[router][ci] Disable cache by @key4ng in https://github.com/sgl-project/sglang/pull/12752
fix mamba prefix cache leak caused by abort by @yizhang2077 in https://github.com/sgl-project/sglang/pull/12693
[BUGFIX] fix output_ids in abort by @yizhang2077 in https://github.com/sgl-project/sglang/pull/12737
[GDN] Fuse b.sigmoid(), fused_gdn_gating and unsqueeze into one kernel: up to 0.85% e2e speedup by @byjiang1996 in https://github.com/sgl-project/sglang/pull/12508
[VLM] Optimize qwen_vl preprocess_video by @yuan-luo in https://github.com/sgl-project/sglang/pull/12240
Add timing metrics for requests by @cicirori in https://github.com/sgl-project/sglang/pull/12646
fix qwen3-omni audio length < 30s by @jiapingW in https://github.com/sgl-project/sglang/pull/12674
docs: document video-capable multimodal models by @WazupSteve in https://github.com/sgl-project/sglang/pull/12565
fix ci by @key4ng in https://github.com/sgl-project/sglang/pull/12760
[Refactor] Refactor fused_moe_triton tuning tools: extract shared utils, add EP/MLLM support, reduce overhead by @BBuf in https://github.com/sgl-project/sglang/pull/12440
Update dsv3 quantization auto setting for sm100 by @ispobock in https://github.com/sgl-project/sglang/pull/12778
chore: bump SGLang version to 0.5.5 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12739
New Contributors
@thelongestusernameofall made their first contribution in https://github.com/sgl-project/sglang/pull/11909
@LucaLow made their first contribution in https://github.com/sgl-project/sglang/pull/12071
@vipwangerxiao made their first contribution in https://github.com/sgl-project/sglang/pull/9501
@Johnsonms made their first contribution in https://github.com/sgl-project/sglang/pull/11936
@ash-sigh made their first contribution in https://github.com/sgl-project/sglang/pull/11047
@Kevin-XiongC made their first contribution in https://github.com/sgl-project/sglang/pull/12057
@kaln27 made their first contribution in https://github.com/sgl-project/sglang/pull/9403
@haichao592 made their first contribution in https://github.com/sgl-project/sglang/pull/12186
@satyamk7054 made their first contribution in https://github.com/sgl-project/sglang/pull/11142
@weireweire made their first contribution in https://github.com/sgl-project/sglang/pull/11708
@bmac3 made their first contribution in https://github.com/sgl-project/sglang/pull/12295
@Gao016 made their first contribution in https://github.com/sgl-project/sglang/pull/12313
@lianakoleva made their first contribution in https://github.com/sgl-project/sglang/pull/12269
@zyzshishui made their first contribution in https://github.com/sgl-project/sglang/pull/12144
@zejunchen-zejun made their first contribution in https://github.com/sgl-project/sglang/pull/11709
@AichenF made their first contribution in https://github.com/sgl-project/sglang/pull/11737
@JensenFire made their first contribution in https://github.com/sgl-project/sglang/pull/12319
@Chen-0210 made their first contribution in https://github.com/sgl-project/sglang/pull/12012
@popsiclexu made their first contribution in https://github.com/sgl-project/sglang/pull/12368
@lpc0220 made their first contribution in https://github.com/sgl-project/sglang/pull/11116
@elvischenv made their first contribution in https://github.com/sgl-project/sglang/pull/12307
@sjtushenhai made their first contribution in https://github.com/sgl-project/sglang/pull/12266
@LuYanFCP made their first contribution in https://github.com/sgl-project/sglang/pull/11757
@carolove made their first contribution in https://github.com/sgl-project/sglang/pull/12428
@Surya-Gunukula made their first contribution in https://github.com/sgl-project/sglang/pull/12226
@Alexhaoge made their first contribution in https://github.com/sgl-project/sglang/pull/11052
@JackChuang made their first contribution in https://github.com/sgl-project/sglang/pull/10078
@bingps made their first contribution in https://github.com/sgl-project/sglang/pull/12453
@hhaAndroid made their first contribution in https://github.com/sgl-project/sglang/pull/12367
@yeahdongcn made their first contribution in https://github.com/sgl-project/sglang/pull/12560
@akhilg-nv made their first contribution in https://github.com/sgl-project/sglang/pull/12044
@alisonshao made their first contribution in https://github.com/sgl-project/sglang/pull/12585
@soaringk made their first contribution in https://github.com/sgl-project/sglang/pull/12456
@ai-easy-cpu made their first contribution in https://github.com/sgl-project/sglang/pull/12486
@MoyanZitto made their first contribution in https://github.com/sgl-project/sglang/pull/12702
@wyx-0203 made their first contribution in https://github.com/sgl-project/sglang/pull/12631
@WazupSteve made their first contribution in https://github.com/sgl-project/sglang/pull/12565
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.4...v0.5.5