Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, https://github.com/sgl-project/sglang/issues/11989)
New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
Native ModelOpt quantization support
What's Changed
[router] add ipv6 support across all components by @slin1237 in https://github.com/sgl-project/sglang/pull/11219
Remove env var warnings for release by @merrymercy in https://github.com/sgl-project/sglang/pull/11262
Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/7149
[router][tool call] Clean up redundant detect_format and has_tool_markers by @CatherineSue in https://github.com/sgl-project/sglang/pull/11270
disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in https://github.com/sgl-project/sglang/pull/11274
docker: add manifest to versioned docker releases by @ishandhanani in https://github.com/sgl-project/sglang/pull/11268
[Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in https://github.com/sgl-project/sglang/pull/11182
[router][grpc] Refine streaming processes by @CatherineSue in https://github.com/sgl-project/sglang/pull/11277
Fix code sync scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/11276
[Auto Sync] Update test_utils.py (20251006) by @merrymercy in https://github.com/sgl-project/sglang/pull/11280
Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in https://github.com/sgl-project/sglang/pull/11279
Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in https://github.com/sgl-project/sglang/pull/11238
Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/11261
fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11282
docs: update sgl-kernel README by @zhyncs in https://github.com/sgl-project/sglang/pull/11286
chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11281
[router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in https://github.com/sgl-project/sglang/pull/11283
convert test_deterministic into unit tests by @skyzh in https://github.com/sgl-project/sglang/pull/11095
Feature/longbench v2 evaluation utils by @alhridoy in https://github.com/sgl-project/sglang/pull/10949
[ci] fix pp test by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11294
EAGLE cache fix for SWARadixCache by @ispobock in https://github.com/sgl-project/sglang/pull/11231
Remove overlap thread by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11210
[router] add reasoning and tool parser argument in router by @slin1237 in https://github.com/sgl-project/sglang/pull/11290
Remove sampling info events and overlap thread file by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11300
Introduce future indices by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11301
[sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/11068
[Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in https://github.com/sgl-project/sglang/pull/11302
[router] add get server info and get model info in grpc server by @slin1237 in https://github.com/sgl-project/sglang/pull/11303
[router][grpc] Refactor chat template content format detection by @CatherineSue in https://github.com/sgl-project/sglang/pull/11288
[Doc] HiCache Design Documents by @ykwd in https://github.com/sgl-project/sglang/pull/11027
[Doc]: Best Practice for HICache by @hzh0425 in https://github.com/sgl-project/sglang/pull/11001
[router] fix grpc connection conversion and add optimization by @slin1237 in https://github.com/sgl-project/sglang/pull/11305
[router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in https://github.com/sgl-project/sglang/pull/11306
Update tool parser and related documentation by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11223
[router][grpc] Fix error message format in grpc chat handler by @CatherineSue in https://github.com/sgl-project/sglang/pull/11307
[quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in https://github.com/sgl-project/sglang/pull/11205
[router] support Openai router conversation API CRUD by @key4ng in https://github.com/sgl-project/sglang/pull/11297
[router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in https://github.com/sgl-project/sglang/pull/11311
[router] cleanup worker health check to return early by @slin1237 in https://github.com/sgl-project/sglang/pull/11310
[oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults by @CatherineSue in https://github.com/sgl-project/sglang/pull/11304
Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in https://github.com/sgl-project/sglang/pull/11200
ci: unify the model launch method of nightly ci by @mickqian in https://github.com/sgl-project/sglang/pull/11230
[Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in https://github.com/sgl-project/sglang/pull/10710
update sampling_params documentation with defaults by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11315
Optimize copy_kv_cache for spec decoding by @YAMY1234 in https://github.com/sgl-project/sglang/pull/11126
Rename ngram_utils -> ngram_info by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11316
[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in https://github.com/sgl-project/sglang/pull/11314
[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in https://github.com/sgl-project/sglang/pull/9545
[8/N] MoE Refactor: deprecate EPMoE by @ch-wan in https://github.com/sgl-project/sglang/pull/11211
Skip weight loading in deepgemm compilation by @ch-wan in https://github.com/sgl-project/sglang/pull/11312
[2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in https://github.com/sgl-project/sglang/pull/10937
[Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11321
fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in https://github.com/sgl-project/sglang/pull/11007
Support LoRA in bench_serving oai interface by @lifuhuang in https://github.com/sgl-project/sglang/pull/11318
benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in https://github.com/sgl-project/sglang/pull/9812
[CI] improve disaggregation CI. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11264
model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in https://github.com/sgl-project/sglang/pull/10909
[router] refactor generate to use new pipeline arch by @slin1237 in https://github.com/sgl-project/sglang/pull/11323
[router] improve reasoning parser lock and reduce req cloning by @slin1237 in https://github.com/sgl-project/sglang/pull/11336
[router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in https://github.com/sgl-project/sglang/pull/11340
[router] Fix all unused_qualifications by @CatherineSue in https://github.com/sgl-project/sglang/pull/11341
[router] Support history management using conversation by @key4ng in https://github.com/sgl-project/sglang/pull/11339
[router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in https://github.com/sgl-project/sglang/pull/11342
fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in https://github.com/sgl-project/sglang/pull/11327
[Auto Sync] Update scheduler.py (20251009) by @zhyncs in https://github.com/sgl-project/sglang/pull/11350
[Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in https://github.com/sgl-project/sglang/pull/10979
[router][grpc] disable health check generation and increase timeout by @slin1237 in https://github.com/sgl-project/sglang/pull/11353
[router] Refactor OpenAI router: split monolithic file and move location by @key4ng in https://github.com/sgl-project/sglang/pull/11359
[router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in https://github.com/sgl-project/sglang/pull/11366
[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in https://github.com/sgl-project/sglang/pull/11309
[router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in https://github.com/sgl-project/sglang/pull/11373
add code pp support for nixl by @shaharmor98 in https://github.com/sgl-project/sglang/pull/11375
fix bench_serving mishandling of internal states by @shaharmor98 in https://github.com/sgl-project/sglang/pull/11376
[router][grpc] Replace fake health check with correct ones by @CatherineSue in https://github.com/sgl-project/sglang/pull/11387
[router] change grpc client from mutable to clone by @slin1237 in https://github.com/sgl-project/sglang/pull/11394
chore: upgrade flashinfer 0.4.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/11364
[router] conversation item API: create, retrieve and delete by @key4ng in https://github.com/sgl-project/sglang/pull/11369
chore: bump SGLang version to 0.5.3.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11324
move more files under srt/utils by @merrymercy in https://github.com/sgl-project/sglang/pull/11285
[grammar] Avoid server crash when grammar backend is None by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11401
fix: fix gpu-proc affinity set incorrectly when pp_size > 1 by @acelyc111 in https://github.com/sgl-project/sglang/pull/11389
[Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded by @glenliu21 in https://github.com/sgl-project/sglang/pull/11365
[CI] Refactor PD disaggregation test suite by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11363
Replace pad with cat for better performance by @yuan-luo in https://github.com/sgl-project/sglang/pull/11388
fix: reinstall torch in deps install by @zhyncs in https://github.com/sgl-project/sglang/pull/11414
feat(hicache): Support passing prefix keys for l3 store. by @hzh0425 in https://github.com/sgl-project/sglang/pull/9045
fix file and object naming scheme in HiCacheNixl to avoid data corruption by @ziruiliu in https://github.com/sgl-project/sglang/pull/10969
Dedicated toml files for CPU/XPU by @ZailiWang in https://github.com/sgl-project/sglang/pull/10734
Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in https://github.com/sgl-project/sglang/pull/11144
chore: update pyproject by @zhyncs in https://github.com/sgl-project/sglang/pull/11420
fix: fix video input for qwen3-vl by @mickqian in https://github.com/sgl-project/sglang/pull/11361
perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/11381
[HiCache] feat: add multi tenant with prefix tag by @stmatengss in https://github.com/sgl-project/sglang/pull/9256
[CI] Merge build-dev into workflow matrix by @csahithi in https://github.com/sgl-project/sglang/pull/11345
Revert "perf: optimize qwen-vl with symm mem allreduce" by @ch-wan in https://github.com/sgl-project/sglang/pull/11436
Revert "fix: fix video input for qwen3-vl" by @merrymercy in https://github.com/sgl-project/sglang/pull/11437
Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" by @scottjlee in https://github.com/sgl-project/sglang/pull/11433
[router] Fix ci nvcc not found error by @key4ng in https://github.com/sgl-project/sglang/pull/11411
feat(mooncake): support GB suffix for global_segment_size by @xiaguan in https://github.com/sgl-project/sglang/pull/10745
Separate allocation logic from scheduler by @cctry in https://github.com/sgl-project/sglang/pull/11313
[router] disable rate limiter by default by @slin1237 in https://github.com/sgl-project/sglang/pull/11435
[router] leverage RAII to actively cancel request during client disconnect by @slin1237 in https://github.com/sgl-project/sglang/pull/11399
[router][grpc] Consolidate parser checks for chat completions by @CatherineSue in https://github.com/sgl-project/sglang/pull/11439
Reorder PD disagg CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11438
fix: Change dsv32 hack temporary path to use system temp directory by @wxsms in https://github.com/sgl-project/sglang/pull/11445
Fix batch invariant ops by @hebiao064 in https://github.com/sgl-project/sglang/pull/11368
[BugFix] test_mla_fp8.py fails on Cublas 12.9 by @Liu-congo in https://github.com/sgl-project/sglang/pull/11360
[DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton by @byjiang1996 in https://github.com/sgl-project/sglang/pull/11450
Remove tilelang dependency in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/11455
Enable native ModelOpt quantization support (2/3) by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/9991
Reland [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in https://github.com/sgl-project/sglang/pull/10312
Super tiny delete unused openai router in sgl-router by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11448
Adjust logits metada init for target verify by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11467
[Documentation][Configuration] Server args and documentation of PD-Multiplexing. by @ykcombat in https://github.com/sgl-project/sglang/pull/11427
Fix enable_v2 in int8 quant by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11470
[Fix] Fix split prefill with fa3. by @ykcombat in https://github.com/sgl-project/sglang/pull/11428
fix stop when stream by @whybeyoung in https://github.com/sgl-project/sglang/pull/11462
Add option to disable any_whitespace for xgrammar and llguidance backends. by @lulor in https://github.com/sgl-project/sglang/pull/8919
[7/n] decouple quantization impl from vllm dependency - gguf kernel by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11019
fix Xeon CI by @ZailiWang in https://github.com/sgl-project/sglang/pull/11454
[CI] Add nightly builds to dockerhub by @csahithi in https://github.com/sgl-project/sglang/pull/9804
[Feature] support regex strings as a stopping condition by @glenliu21 in https://github.com/sgl-project/sglang/pull/10635
Beta spec-overlap for EAGLE by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11398
Piecewise CUDA Graph Support & Torch Compile Backend by @Oasis-Git in https://github.com/sgl-project/sglang/pull/10062
[Router]: Small Typo in a comment within tree.rs by @xuwenyihust in https://github.com/sgl-project/sglang/pull/11489
chore: bump sgl-kernel version to 0.3.16 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11476
[smol] [perf] Qwen3-VL in place op. by @vincentzed in https://github.com/sgl-project/sglang/pull/11481
[chore][1/N] Avoid using default mutable parameters by @kevin85421 in https://github.com/sgl-project/sglang/pull/11478
[bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends by @MahmoudAshraf97 in https://github.com/sgl-project/sglang/pull/10172
[ perf ] Replace json-> orjson in hot path by @vincentzed in https://github.com/sgl-project/sglang/pull/11221
[chore][2/N] Avoid using default mutable parameters by @kevin85421 in https://github.com/sgl-project/sglang/pull/11479
Fix the GPT function calling regex to allow dash in the name by @antoine-roux in https://github.com/sgl-project/sglang/pull/10577
bailingMoE: Fix Key error of deepep_mode by @QiuMike in https://github.com/sgl-project/sglang/pull/11465
Fix CI break by express-laned PRs. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11499
Move args from global_config to environ by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11332
move fla env check position by @yizhang2077 in https://github.com/sgl-project/sglang/pull/11500
Temporarily remove b200 tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11501
Fix port conflicts in CI by @merrymercy in https://github.com/sgl-project/sglang/pull/11497
temporarily remove b200 tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11502
Fix unit tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11503
Bugfix: Fix Type consistency for KV indices in SWARadixCache by @hzh0425 in https://github.com/sgl-project/sglang/pull/11452
doc: add doc for adding new models into nightly-ci by @mickqian in https://github.com/sgl-project/sglang/pull/11443
[CI] fix lint by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11509
Deprecate global_server_args_dict by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11331
chore: remove flashinfer cleanup cache by @zhyncs in https://github.com/sgl-project/sglang/pull/11514
fix: revert temporarily remove b200 tests by @zhyncs in https://github.com/sgl-project/sglang/pull/11515
[Fix] Improve longbench prompt and other logics by @byjiang1996 in https://github.com/sgl-project/sglang/pull/11474
Sync changes on io_struct.py and deterministic ops by @merrymercy in https://github.com/sgl-project/sglang/pull/11498
[lint] Fix the lint issue by @ch-wan in https://github.com/sgl-project/sglang/pull/11516
Revert "Deprecate global_server_args_dict" by @ch-wan in https://github.com/sgl-project/sglang/pull/11520
Improve dp attention port assignment scheme by @jokerwyt in https://github.com/sgl-project/sglang/pull/5889
[router] openai router: support grok model by @key4ng in https://github.com/sgl-project/sglang/pull/11511
docs(router): add token-bucket rate limiting to the docs by @Jonahcb in https://github.com/sgl-project/sglang/pull/11485
[sgl-kernel][1/N]Support Expert Specialization Grouped GEMM by @HydraQYH in https://github.com/sgl-project/sglang/pull/11432
Update DeepSeek-R1-FP4 default config on blackwell by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11512
[Fix]: add missing device attribute to ChunkCache by @leavelet in https://github.com/sgl-project/sglang/pull/11493
[Feature] Support mamba radix cache v0 by @yizhang2077 in https://github.com/sgl-project/sglang/pull/11214
ci: improve nightly-ci by @mickqian in https://github.com/sgl-project/sglang/pull/11385
[CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering by @BBuf in https://github.com/sgl-project/sglang/pull/11505
[HICache]: Support 3FS-Store with page_first_direct layout by @hzh0425 in https://github.com/sgl-project/sglang/pull/11460
Tiny fix test run estimated time by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11544
[Reland] perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/11457
Depreate global_server_args_dict by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11528
[Fix] Add per_channel_quant parameter to MoE config functions by @mmangkad in https://github.com/sgl-project/sglang/pull/11201
[router][ci] Add Nightly Release Workflow for SGLang Router by @slin1237 in https://github.com/sgl-project/sglang/pull/11527
[router] add tokenizer path to be dir by @slin1237 in https://github.com/sgl-project/sglang/pull/11530
Remove tp_worker.worker by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11548
fix: fix video input for qwen3-vl by @mickqian in https://github.com/sgl-project/sglang/pull/11442
[NVIDIA] BUMP FA3 by @johnnynunez in https://github.com/sgl-project/sglang/pull/11444
[Fix] Include grpc reflection runtime dependency by @ai-jz in https://github.com/sgl-project/sglang/pull/11419
Adjust overlap event loop by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11507
Move deep gemm related arguments to sglang.srt.environ by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11547
[router][grpc] Further delegate non-stream processing to processing.rs by @CatherineSue in https://github.com/sgl-project/sglang/pull/11553
[router] allow user to specify chat template path by @slin1237 in https://github.com/sgl-project/sglang/pull/11549
Minor: improve sampler & remove unused fields from model_config.py by @merrymercy in https://github.com/sgl-project/sglang/pull/11531
[router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter by @Jonahcb in https://github.com/sgl-project/sglang/pull/11483
Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in https://github.com/sgl-project/sglang/pull/11441
Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) by @trevor-m in https://github.com/sgl-project/sglang/pull/11557
[CI] Add Basic Test for DeepSeek V3.2 by @Fridge003 in https://github.com/sgl-project/sglang/pull/11308
[router][grpc] Add error handling to generate_tool_constraints by @CatherineSue in https://github.com/sgl-project/sglang/pull/11562
[NVIDIA] update pyproject.toml to support cu130 option by @johnnynunez in https://github.com/sgl-project/sglang/pull/11521
[CI Monitor] Ci monitor only deal with main branch in default by @BBuf in https://github.com/sgl-project/sglang/pull/11538
Tiny cleanup fp4 gemm calls by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11537
[router][grpc] Add serve_grpc to launch_server and log id for HealthCheck by @CatherineSue in https://github.com/sgl-project/sglang/pull/11564
[router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/11571
[sgl-kernel][2/N]Support Expert Specialization Grouped GEMM by @HydraQYH in https://github.com/sgl-project/sglang/pull/11534
chore: bump sgl-kernel version to 0.3.16.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11573
Fix accept rate in speculative decoding metrics by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11572
Compilation Folder Reset by @Oasis-Git in https://github.com/sgl-project/sglang/pull/11539
[FEATURE] Add Profile Trace Merger for Distributed Traces by @neelabhsinha in https://github.com/sgl-project/sglang/pull/11413
[DSv32] Use torch.compile for _get_logits_head_gate by @trevor-m in https://github.com/sgl-project/sglang/pull/11565
Make DeepEP combine recv do not overlap by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11535
bench_serving support PD Disaggregation by @BBuf in https://github.com/sgl-project/sglang/pull/11542
Implement LRU eviction policy for LoRA adapters by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/11041
Revert "[NVIDIA] BUMP FA3 (#11444)" by @zhyncs in https://github.com/sgl-project/sglang/pull/11582
chore: bump sgl-kernel version to 0.3.16.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11583
[Auto Sync] Update model_config.py (20251014) by @merrymercy in https://github.com/sgl-project/sglang/pull/11580
Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11587
[router][protocols] Add Axum validate extractor and use it for /v1/chat/completions endpoint by @CatherineSue in https://github.com/sgl-project/sglang/pull/11588
[router] update generate spec to align with sgl io struct by @slin1237 in https://github.com/sgl-project/sglang/pull/11591
[router] change worker api to async instead of sync by @slin1237 in https://github.com/sgl-project/sglang/pull/11566
Update news section in README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/11598
[router] delete useless table content comment in spec by @slin1237 in https://github.com/sgl-project/sglang/pull/11597
[router] allow router launch server to use grpc mode by @slin1237 in https://github.com/sgl-project/sglang/pull/11600
[Docs] [Router]: Update sg-router doc on circuit breaker by @xuwenyihust in https://github.com/sgl-project/sglang/pull/11449
[router] when given both local tokenizer and chat template, log all by @slin1237 in https://github.com/sgl-project/sglang/pull/11601
[AMD CI] Add image and weights caching. by @saienduri in https://github.com/sgl-project/sglang/pull/11593
Update release-docker-dev.yml by @sglang-bot in https://github.com/sgl-project/sglang/pull/11603
Optimize Triton Draft Backend by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11556
Refactor spec decoding metrics calculation into separate TokenizerManager utility function by @scottjlee in https://github.com/sgl-project/sglang/pull/11586
make radix cache deterministic by @skyzh in https://github.com/sgl-project/sglang/pull/10721
move eagle draft post process to cuda graph by @cicirori in https://github.com/sgl-project/sglang/pull/11434
Reduce one step decode for draft model. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11561
[router] add py binding and readme for openai router and history backend by @key4ng in https://github.com/sgl-project/sglang/pull/11453
[router] cleanup app context and move to startup by @slin1237 in https://github.com/sgl-project/sglang/pull/11617
[router] add chang and keyang to sgl router author by @slin1237 in https://github.com/sgl-project/sglang/pull/11620
use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. by @strgrb in https://github.com/sgl-project/sglang/pull/11605
[router] update router readme to latest features by @slin1237 in https://github.com/sgl-project/sglang/pull/11619
Fix log for chunked prefix cache by @Fridge003 in https://github.com/sgl-project/sglang/pull/11624
[Auto Sync] Update scheduler.py, server_args.py (20251014) by @merrymercy in https://github.com/sgl-project/sglang/pull/11623
[Auto Sync] Update collector.py (20251014) by @merrymercy in https://github.com/sgl-project/sglang/pull/11625
[Minor] Update xgrammar dependency by @DarkSharpness in https://github.com/sgl-project/sglang/pull/11622
Update install.md by @merrymercy in https://github.com/sgl-project/sglang/pull/11631
fix: Update SGL_KERNEL_VERSION to 0.3.15 by @zhyncs in https://github.com/sgl-project/sglang/pull/11633
[router][grpc] add warm up to grpc server by @slin1237 in https://github.com/sgl-project/sglang/pull/11627
Refactor kv cache free by @cctry in https://github.com/sgl-project/sglang/pull/11351
[router] update router doc to latest features by @slin1237 in https://github.com/sgl-project/sglang/pull/11639
fix: upgrade transformers to 4.57.1 by @csahithi in https://github.com/sgl-project/sglang/pull/11628
[router] add worker self discovery for metadata by @slin1237 in https://github.com/sgl-project/sglang/pull/11638
[router] upgrade to 0.2.0 by @slin1237 in https://github.com/sgl-project/sglang/pull/11642
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP by @UNIDY2002 in https://github.com/sgl-project/sglang/pull/10423
[1/N]Support DeepSeek-R1 w4a8 normal deepep by @ayrnb in https://github.com/sgl-project/sglang/pull/8247
[Fix] Fix accuracy bug in CSGMV kernel caching key. by @lifuhuang in https://github.com/sgl-project/sglang/pull/11579
feat: add add_chunked_prefix_cache_attention_backend by @zhyncs in https://github.com/sgl-project/sglang/pull/11636
Super tiny improve FA3 import error message by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11590
[BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl by @ZhengWG in https://github.com/sgl-project/sglang/pull/11458
[Doc] Update support matrix for attn and hybrid attn by @b8zhong in https://github.com/sgl-project/sglang/pull/11293
Clean up some Qwen3-Next and deterministic code by @hebiao064 in https://github.com/sgl-project/sglang/pull/11585
docs: update sglang installation guide by @zhyncs in https://github.com/sgl-project/sglang/pull/11659
Tiny cleanup some eagle unused codes by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11660
Fix 1-step draft model forward by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11653
[tool call] Fix prev_tool_call_arr management in base_format_detector.py by @CatherineSue in https://github.com/sgl-project/sglang/pull/11367
[router] Fix response api related spec by @key4ng in https://github.com/sgl-project/sglang/pull/11621
Fix missing json imports in serving_responses.py by @CatherineSue in https://github.com/sgl-project/sglang/pull/11681
[sgl-kernel][3/N]Support Expert Specialization Grouped GEMM by @HydraQYH in https://github.com/sgl-project/sglang/pull/11674
[sgl-kernel] Optimize gguf test by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11667
[router][grpc] Simplify model_id determination by @CatherineSue in https://github.com/sgl-project/sglang/pull/11684
[router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding by @slin1237 in https://github.com/sgl-project/sglang/pull/11676
chore: bump SGLang version to 0.5.3.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11680
[CI][XPU]enable sglang CI on Intel XPU by @DiweiSun in https://github.com/sgl-project/sglang/pull/9493
enable rmsnorm on XPU by @huaiyuzh in https://github.com/sgl-project/sglang/pull/10248
Sync code and test CI; rename some env vars by @merrymercy in https://github.com/sgl-project/sglang/pull/11686
docs: Add Contributor Covenant Code of Conduct by @zhyncs in https://github.com/sgl-project/sglang/pull/11689
[Mamba] Increase default mamba_full_memory_ratio to 0.9 by @hanming-lu in https://github.com/sgl-project/sglang/pull/11679
[PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) by @ShangmingCai in https://github.com/sgl-project/sglang/pull/10912
[sgl-kernel] support hadamard by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11663
Fix missing a2a backend init of GLM4.5 MoE Block by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11692
Split test_intel_amx_attention_backend.py to pass CI of timeout by @yanbing-j in https://github.com/sgl-project/sglang/pull/11370
Set csgmv as default lora backend. by @lifuhuang in https://github.com/sgl-project/sglang/pull/11488
[Bugfix] Fix Qwen3/DSV3/DSV3.2 model support by @iforgetmyname in https://github.com/sgl-project/sglang/pull/11510
[CI] Add GLM4MoE model test by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11706
[router] fix get_models endpoint for openai router by @key4ng in https://github.com/sgl-project/sglang/pull/11687
[ci]use H20 to run disaggregation test by @HanHan009527 in https://github.com/sgl-project/sglang/pull/11543
chore: bump SGLang version to 0.5.3.post3 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11693
model: qwen3-omni (thinker-only) by @mickqian in https://github.com/sgl-project/sglang/pull/10911
[Router] Refactor protocol definitions: split spec.rs into modular files by @key4ng in https://github.com/sgl-project/sglang/pull/11677
[router] fix p and d worker filtering and bootstrap port handling by @slin1237 in https://github.com/sgl-project/sglang/pull/11729
[router][grpc] add dissag info to warm up in grpc server by @slin1237 in https://github.com/sgl-project/sglang/pull/11727
[router] Fix tool_choice normalization in ChatCompletionRequest and fix ut by @CatherineSue in https://github.com/sgl-project/sglang/pull/11731
Revert "make radix cache deterministic" by @Fridge003 in https://github.com/sgl-project/sglang/pull/11728
Reduce the image processing latency in VLM by @zhooooong in https://github.com/sgl-project/sglang/pull/11541
[router] add spec.rs to enables tests under spec folder by @key4ng in https://github.com/sgl-project/sglang/pull/11734
[router] Add rustfmt and set group imports by default by @CatherineSue in https://github.com/sgl-project/sglang/pull/11732
Revert "[router] fix get_models endpoint for openai router (#11687)" by @key4ng in https://github.com/sgl-project/sglang/pull/11740
[router][CI] Clean up deprecated fields in pr-test-pd-router.yml by @CatherineSue in https://github.com/sgl-project/sglang/pull/11739
[CI] Fix broken event loop creation by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11746
[overlap-spec] Make plan stream an option by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11724
ci: reduce and refactor vlm ut and combine test files by @mickqian in https://github.com/sgl-project/sglang/pull/11062
Abstraction for spec worker and code cleanup by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11643
add tuned fuse moe kernel for qwen3 235b fp8 on h200 by @pdasgup in https://github.com/sgl-project/sglang/pull/11730
Revert "Set csgmv as default lora backend. (#11488)" by @zhyncs in https://github.com/sgl-project/sglang/pull/11735
[router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder by @slin1237 in https://github.com/sgl-project/sglang/pull/11766
[router] fix grpc client time out to 1h by @slin1237 in https://github.com/sgl-project/sglang/pull/11768
[doc] update router document by @key4ng in https://github.com/sgl-project/sglang/pull/11767
[Feature] Reuse flashinfer workspace for PD-Multiplexing. by @ykcombat in https://github.com/sgl-project/sglang/pull/11540
Turn on shm_allreduce and shm_allgather for fp16 by @chunyuan-w in https://github.com/sgl-project/sglang/pull/10725
[Auto Sync] Update scheduler.py (20251017) by @zhyncs in https://github.com/sgl-project/sglang/pull/11738
[router][grpc] Remove timeout for connections and remove max_tokens deprecation warning log by @CatherineSue in https://github.com/sgl-project/sglang/pull/11775
Cleaning indexer for DeepSeek V3.2 by @Fridge003 in https://github.com/sgl-project/sglang/pull/11682
[minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11777
[Auto Sync] Update common.py (20251017) by @merrymercy in https://github.com/sgl-project/sglang/pull/11782
[Fix] Skip visual layers when applying LoRA to Qwen2VL modules by @anvdn in https://github.com/sgl-project/sglang/pull/11519
[Lint] Add python/sglang to ruff F401 checks and remove unused imports in files by @CatherineSue in https://github.com/sgl-project/sglang/pull/11685
Super tiny fix missing input throughput by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11607
Support shared experts overlap in cutlass moe by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11611
Support casting bf16 NextN moe to fp8 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11613
Manually flip deepep_mode for cuda_graph by @zhuzilin in https://github.com/sgl-project/sglang/pull/11666
Set CUDA_VISIBLE_DEVICES to achieve one GPU per process by @merrymercy in https://github.com/sgl-project/sglang/pull/9170
Super tiny fix CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11788
Make single-batch overlap compatible with offloading by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11614
completely remove mixed mode deterministic test as prefix mode could cover it by @zminglei in https://github.com/sgl-project/sglang/pull/11783
[Refactor] move deep_gemm_wrapper out of quantization by @ch-wan in https://github.com/sgl-project/sglang/pull/11784
Enable lint on main by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11794
[router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client by @CatherineSue in https://github.com/sgl-project/sglang/pull/11798
Try add back no-commit-to-branch by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11799
fix(glm45): disable reduce scatter by @jinmingyi1998 in https://github.com/sgl-project/sglang/pull/11665
fix command line usage of profiling by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11793
[RL] support weight update with DP attention by @zhuzilin in https://github.com/sgl-project/sglang/pull/11669
[RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded by @zhuzilin in https://github.com/sgl-project/sglang/pull/10152
set default attention backend for deterministic inference by @zminglei in https://github.com/sgl-project/sglang/pull/11801
Eager Compiler for Torch Compile by @Oasis-Git in https://github.com/sgl-project/sglang/pull/11803
Fix install instructions and pyproject.tomls by @merrymercy in https://github.com/sgl-project/sglang/pull/11781
Bump torch_memory_saver to avoid installing pre-release versions by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11797
[HiCache] feat: add more eviction policy by @stmatengss in https://github.com/sgl-project/sglang/pull/11506
[overlap-spec] support page size > 1 by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11772
support server arg override KV cache to bf16 to avoid slow cases by @b8zhong in https://github.com/sgl-project/sglang/pull/11749
feat(example/fastapi): support --startup-timeout using Qwen3-Next-80B-A3B-Instruct as example by @Kindyaa in https://github.com/sgl-project/sglang/pull/11710
ci: update lmms-eval to speed up multimodal CI by @b8zhong in https://github.com/sgl-project/sglang/pull/11000
Use cutlass fp4 gemm by default by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11813
Fix Dockerfile not installing correct version of DeepEP for arm build by @kyleliang-nv in https://github.com/sgl-project/sglang/pull/11773
[router] Add Configurable L0 and L1 Tokenizer Caching by @slin1237 in https://github.com/sgl-project/sglang/pull/11688
[2/2] [feature] support openai like classification api in router by @whybeyoung in https://github.com/sgl-project/sglang/pull/11670
[1/2][feature] support openai like classification api by @whybeyoung in https://github.com/sgl-project/sglang/pull/11618
make sure logit bias is applied during eagle spec decoding verification by @petricevich in https://github.com/sgl-project/sglang/pull/11555
fix: do not wrap invalid grammar objects during constrained generation by @tazjin in https://github.com/sgl-project/sglang/pull/11328
Improve send_sone script by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11817
Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in https://github.com/sgl-project/sglang/pull/10788
Update CODEOWNERS for layer quantization path by @merrymercy in https://github.com/sgl-project/sglang/pull/11818
support tokenized batch request by @narutolhy in https://github.com/sgl-project/sglang/pull/11091
Tiny add hints when users send requests to wrong place by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11808
Make single-batch overlap compatible with NextN by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11804
Support not officially supported high sgl-kernel version with low srt version by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11786
Avoid generation gets hanging when user specifies multiple event loops by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5162
Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11805
Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11827
[overlap-spec] fix stop condition and trimming by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11819
[Spec Decoding] Support MTP for dsv3.2 by @Paiiiiiiiiiiiiii in https://github.com/sgl-project/sglang/pull/11652
[CI] always print back trace in retry() by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11834
[Test] Add basic matched stop for beta eagle by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11833
Deterministic Mode: Add 1-stage triton kernel for prefill by @hebiao064 in https://github.com/sgl-project/sglang/pull/11147
[logprobs] Enable local deterministic logrprobs testing with strict threshold by @PrinsYin in https://github.com/sgl-project/sglang/pull/10994
[CI] Add CI test for DeepSeek V3.2 MTP by @Fridge003 in https://github.com/sgl-project/sglang/pull/11835
[NVIDIA] FA3/FA4 Fix by @johnnynunez in https://github.com/sgl-project/sglang/pull/11606
[DeepseekV32] Add fast_topk_transform_ragged_fused kernel by @hlu1 in https://github.com/sgl-project/sglang/pull/11815
Fix triton_kernels import error on some hardwares by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11831
Tiny bump DeepEP version in ARM blackwell by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11810
[BugFix] replace the input_to_float8 used in dsv2 by @Liu-congo in https://github.com/sgl-project/sglang/pull/11612
[Doc] Update documents for FA4 by @Fridge003 in https://github.com/sgl-project/sglang/pull/11778
fix(ci): Fix CI Monitor limit parameter and add CI Analysis to summary by @BBuf in https://github.com/sgl-project/sglang/pull/11832
Fix version bump script to handle TOML files with outdated versions by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/11787
Improve Kernel Build Time by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/11508
check master server for mooncake store by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/10510
chore: bump sgl-kernel version to 0.3.16.post3 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11733
Recapture cuda graph after model weight update to resolve IMA error by @harrisonlimh in https://github.com/sgl-project/sglang/pull/11780
[Feature] Use current greenctx stream to communicate in PD-Multiplexing. by @ykcombat in https://github.com/sgl-project/sglang/pull/11594
Support mrope triton kernel and add unit test by @yuan-luo in https://github.com/sgl-project/sglang/pull/11722
[PD] Improve eagle acceptance rate by transferring draft model hidden states by @ZeldaHuang in https://github.com/sgl-project/sglang/pull/10801
Tiny clean up for PD module and doc by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11747
Revert "[CI Monitor] Ci monitor only deal with main branch in default" by @BBuf in https://github.com/sgl-project/sglang/pull/11846
[Model] Add Olmo 3 model support by @2015aroras in https://github.com/sgl-project/sglang/pull/11396
Update amd gpu install docs. by @saienduri in https://github.com/sgl-project/sglang/pull/11849
[AMD CI] Populate image cache in nightly docker release. by @saienduri in https://github.com/sgl-project/sglang/pull/11822
fix(server_args): handle tokenizer init conflicts by @ishandhanani in https://github.com/sgl-project/sglang/pull/11776
[Feature] New structural tag support by @DarkSharpness in https://github.com/sgl-project/sglang/pull/10691
Tiny fix main lint by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11862
[9/N] MoE Refactor: cleanup dispatcher interfaces by @ch-wan in https://github.com/sgl-project/sglang/pull/11847
Fix acc len and gen throughput metrics when enabling overlap-spec by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11823
Replace function call with set literal by @penguin-wwy in https://github.com/sgl-project/sglang/pull/11867
Support mixing cutedsl and deepgemm backend by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11807
[router] Worker Management Workflow Engine by @slin1237 in https://github.com/sgl-project/sglang/pull/11868
[router] remove encoding header for oai router by @slin1237 in https://github.com/sgl-project/sglang/pull/11881
[Auto Sync] Update scheduler.py, server_args.py (20251020) by @merrymercy in https://github.com/sgl-project/sglang/pull/11875
[router][grpc] Remove continue_final_message in ChatTemplateParams and add minijinja-contrib by @CatherineSue in https://github.com/sgl-project/sglang/pull/11882
fix(sql-router): fix conflict port in test by @htiennv in https://github.com/sgl-project/sglang/pull/11826
[router] clean up workflow logs to debug for implementation details logs by @slin1237 in https://github.com/sgl-project/sglang/pull/11886
[code move] move pp into a separate mixin by @merrymercy in https://github.com/sgl-project/sglang/pull/11838
[router][grpc] Fix wram-up random token ids for small models by @CatherineSue in https://github.com/sgl-project/sglang/pull/11887
Revise MRotaryEmbedding's forward by @yuan-luo in https://github.com/sgl-project/sglang/pull/11859
piecewise cuda graph support qwen3-moe by @BBuf in https://github.com/sgl-project/sglang/pull/11845
Fix RotaryEmbedding for fp32 input by @zhangdonghao-zdh in https://github.com/sgl-project/sglang/pull/11843
Init attention backend for Intel XPU by @airMeng in https://github.com/sgl-project/sglang/pull/10656
Use trtllm_mla decode kernel for draft extend in speculative decoding by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11664
[router] release router 0.2.1 by @slin1237 in https://github.com/sgl-project/sglang/pull/11885
[AMD] Update wave-lang to 3.8.0 by @xintin in https://github.com/sgl-project/sglang/pull/11878
init support for KTransformers Heterogeneous Computing by @Atream in https://github.com/sgl-project/sglang/pull/11487
[FEATURE] Add OpenAI-Compatible LoRA Adapter Selection by @neelabhsinha in https://github.com/sgl-project/sglang/pull/11570
[fix] fix ci uv install dependency by @HanHan009527 in https://github.com/sgl-project/sglang/pull/11895
Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] by @whybeyoung in https://github.com/sgl-project/sglang/pull/11416
Simplify multi-tokenizer by @zhengkezhou1 in https://github.com/sgl-project/sglang/pull/11295
[CI] disable glm4.1v and fix the flashinfer installation by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11902
vlm: enforce pybase64 for image and str encode/decode by @b8zhong in https://github.com/sgl-project/sglang/pull/10700
[smol] [perf] Inverse perm improvement by @vincentzed in https://github.com/sgl-project/sglang/pull/11482
[quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n by @kevin85421 in https://github.com/sgl-project/sglang/pull/11702
[CI] Fix b200 flashinfer installation by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11915
Fix flush cache API for spec v2 by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11918
[NVIDIA] Add new SMs support for Spark & Thor by @Kh4L in https://github.com/sgl-project/sglang/pull/11287
Update sgl-kernel and remove fast hadamard depedency by @Fridge003 in https://github.com/sgl-project/sglang/pull/11844
Rename flashmla kernel options of nsa backend for better readability by @Fridge003 in https://github.com/sgl-project/sglang/pull/11876
chore: upgrade flashinfer 0.4.1 by @zhyncs in https://github.com/sgl-project/sglang/pull/11933
[BugFix][Qwen3-VL]: add metadata for video in qwen3-vl by @ZhengWG in https://github.com/sgl-project/sglang/pull/11377
[Auto Sync] Update forward_batch_info.py (20251021) by @zhyncs in https://github.com/sgl-project/sglang/pull/11934
Fix openai input_text type compatibility by @key4ng in https://github.com/sgl-project/sglang/pull/11935
fix: resolve flashinfer 0.4.1 import by @zhyncs in https://github.com/sgl-project/sglang/pull/11940
[router][grpc] Support v1/responses API by @CatherineSue in https://github.com/sgl-project/sglang/pull/11926
[router] Add gRPC E2E test suite by @key4ng in https://github.com/sgl-project/sglang/pull/11790
[router][grpc] Fix background tasks stored with wrong id by @CatherineSue in https://github.com/sgl-project/sglang/pull/11945
[lint] improve ruff check by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11922
[sgl-kernel] support flashmla libtorch by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11717
[NVIDIA] upstream FA4 and fix cccl path by @johnnynunez in https://github.com/sgl-project/sglang/pull/11929
Enable native ModelOpt quantization support (3/3) by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/10154
Fix mooncake dispatcher by @UNIDY2002 in https://github.com/sgl-project/sglang/pull/11908
[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank by @HanHan009527 in https://github.com/sgl-project/sglang/pull/10606
[model] Support POINTSV15Chat model by @josephydu in https://github.com/sgl-project/sglang/pull/9651
Fix flaky hicache test with mooncake backend by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11953
[Fix] Remove unused import from triton_kernels_moe.py by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11967
[router] Support multiple worker URLs for OpenAI router by @key4ng in https://github.com/sgl-project/sglang/pull/11723
[Documentation] add doc for deterministic inference by @zminglei in https://github.com/sgl-project/sglang/pull/11956
[6/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in https://github.com/sgl-project/sglang/pull/10750
[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… by @yuho8818 in https://github.com/sgl-project/sglang/pull/11977
Revert "Recapture cuda graph after model weight update to resolve IMA error " by @merrymercy in https://github.com/sgl-project/sglang/pull/11980
[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel by @jiahanc in https://github.com/sgl-project/sglang/pull/11563
[router] create worker removal step and clean up worker manager by @slin1237 in https://github.com/sgl-project/sglang/pull/11921
Implement BGE-M3 Sparse Embeddings in SGLang by @approximated-intelligence in https://github.com/sgl-project/sglang/pull/10869
[Doc] Update deterministic inference flag in server_arguments.md by @Fridge003 in https://github.com/sgl-project/sglang/pull/11978
[grpc] Support gRPC standard health check by @CatherineSue in https://github.com/sgl-project/sglang/pull/11955
[AMD] Support a new flag to disable quant on parallelLinear layer if required by @yichiche in https://github.com/sgl-project/sglang/pull/11811
[ROCm] Remove vLLM rope dependency & use AITER impl by @b8zhong in https://github.com/sgl-project/sglang/pull/11322
[NVIDIA] Build CUDA 13 by @johnnynunez in https://github.com/sgl-project/sglang/pull/11299
Bump grace blackwell DeepEP version by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11990
[CPU] misc updates by @ZailiWang in https://github.com/sgl-project/sglang/pull/11906
fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 by @zheng1 in https://github.com/sgl-project/sglang/pull/11965
[CPU] Optimize FP16 decode_attention_cpu by @blzheng in https://github.com/sgl-project/sglang/pull/10652
Allow to disable batch decoding. by @LorrinWWW in https://github.com/sgl-project/sglang/pull/11944
Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend by @cicirori in https://github.com/sgl-project/sglang/pull/11985
aiter update to v0.1.6.post1 by @HaiShaw in https://github.com/sgl-project/sglang/pull/12004
Support overlap-spec-v2 with trtllm_mla attention backend by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11821
Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 by @netanel-haber in https://github.com/sgl-project/sglang/pull/11866
[router] Add comprehensive E2E tests for Response API by @key4ng in https://github.com/sgl-project/sglang/pull/11988
[Router] Consolidate ConnectionMode enum to core module by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/11937
Move memory runtime checker to mixin class by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12014
Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12015
[Fix] memory leak by overlap + retract by @cctry in https://github.com/sgl-project/sglang/pull/11981
[Feature] Support loading weights from ckpt engine worker by @stmatengss in https://github.com/sgl-project/sglang/pull/11755
[router] change ci names and update log level in ci by @slin1237 in https://github.com/sgl-project/sglang/pull/12021
Feature/nano v2 offline modelopt fp8 and nvfp4 by @netanel-haber in https://github.com/sgl-project/sglang/pull/12018
[Auto Sync] Update test_deterministic_utils.py (20251023) by @merrymercy in https://github.com/sgl-project/sglang/pull/12022
ci: fix night-ci with push retry mechanism by @mickqian in https://github.com/sgl-project/sglang/pull/11765
[router][CI] Clean up imports and prints statements in sgl-router/py_test by @CatherineSue in https://github.com/sgl-project/sglang/pull/12024
Add AWQ quantization support for NPU. by @ErvinXie in https://github.com/sgl-project/sglang/pull/10158
model: support deepseek-ocr by @mickqian in https://github.com/sgl-project/sglang/pull/11891
Log iteration # for prefill and decode by @nvcastet in https://github.com/sgl-project/sglang/pull/9366
Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" by @b8zhong in https://github.com/sgl-project/sglang/pull/12028
Fix mamba radix cache eviction logic in alloc_req_slots by @rogeryoungh in https://github.com/sgl-project/sglang/pull/11616
Update Github action title for kernel build by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12029
[router] Add builder pattern for RouterConfig with zero duplication by @slin1237 in https://github.com/sgl-project/sglang/pull/12030
Fixed aarch64 flash-mla by @nvjullin in https://github.com/sgl-project/sglang/pull/12009
chore: bump SGLang version to 0.5.4 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12027
New Contributors
@xuwenyihust made their first contribution in https://github.com/sgl-project/sglang/pull/11302
@ziruiliu made their first contribution in https://github.com/sgl-project/sglang/pull/10969
@scottjlee made their first contribution in https://github.com/sgl-project/sglang/pull/11144
@Liu-congo made their first contribution in https://github.com/sgl-project/sglang/pull/11360
@lulor made their first contribution in https://github.com/sgl-project/sglang/pull/8919
@antoine-roux made their first contribution in https://github.com/sgl-project/sglang/pull/10577
@QiuMike made their first contribution in https://github.com/sgl-project/sglang/pull/11465
@ai-jz made their first contribution in https://github.com/sgl-project/sglang/pull/11419
@neelabhsinha made their first contribution in https://github.com/sgl-project/sglang/pull/11413
@UNIDY2002 made their first contribution in https://github.com/sgl-project/sglang/pull/10423
@zhooooong made their first contribution in https://github.com/sgl-project/sglang/pull/11541
@pdasgup made their first contribution in https://github.com/sgl-project/sglang/pull/11730
@anvdn made their first contribution in https://github.com/sgl-project/sglang/pull/11519
@Kindyaa made their first contribution in https://github.com/sgl-project/sglang/pull/11710
@petricevich made their first contribution in https://github.com/sgl-project/sglang/pull/11555
@tazjin made their first contribution in https://github.com/sgl-project/sglang/pull/11328
@Paiiiiiiiiiiiiii made their first contribution in https://github.com/sgl-project/sglang/pull/11652
@2015aroras made their first contribution in https://github.com/sgl-project/sglang/pull/11396
@zhangdonghao-zdh made their first contribution in https://github.com/sgl-project/sglang/pull/11843
@xintin made their first contribution in https://github.com/sgl-project/sglang/pull/11878
@zhengkezhou1 made their first contribution in https://github.com/sgl-project/sglang/pull/11295
@Kh4L made their first contribution in https://github.com/sgl-project/sglang/pull/11287
@yuho8818 made their first contribution in https://github.com/sgl-project/sglang/pull/11977
@jiahanc made their first contribution in https://github.com/sgl-project/sglang/pull/11563
@approximated-intelligence made their first contribution in https://github.com/sgl-project/sglang/pull/10869
@zheng1 made their first contribution in https://github.com/sgl-project/sglang/pull/11965
@ErvinXie made their first contribution in https://github.com/sgl-project/sglang/pull/10158
@rogeryoungh made their first contribution in https://github.com/sgl-project/sglang/pull/11616
@nvjullin made their first contribution in https://github.com/sgl-project/sglang/pull/12009
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.3...v0.5.4