Unclaimed project

Are you a maintainer of sglang? Claim this project to take control of your public changelog and roadmap.

Changelog

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

sgl-project/sglang·

25k5kPythonApache-2.0

·Website

attentionblackwellcudadeepseekdiffusionglm+12

Last updated about 12 hours ago

Back to changelog

NewOctober 26, 2025

Release v0.5.4

Highlights

AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
[beta] Overlap scheduler for speculative decoding: https://github.com/sgl-project/sglang/issues/11762
[beta] Piecewise CUDA graph for prefill: https://github.com/sgl-project/sglang/issues/11490
Prefix cache for qwen3 next and GDN/mamba models: https://github.com/sgl-project/sglang/pull/11214

More Python Projects

AutoGPT

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.

182.9k

Python

stable-diffusion-webui

Stable Diffusion web UI

162.0k

Python

transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

[router] add ipv6 support across all components by @slin1237 in https://github.com/sgl-project/sglang/pull/11219
Remove env var warnings for release by @merrymercy in https://github.com/sgl-project/sglang/pull/11262
Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/7149
[router][tool call] Clean up redundant detect_format and has_tool_markers by @CatherineSue in https://github.com/sgl-project/sglang/pull/11270
disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in https://github.com/sgl-project/sglang/pull/11274
docker: add manifest to versioned docker releases by @ishandhanani in https://github.com/sgl-project/sglang/pull/11268
[Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in https://github.com/sgl-project/sglang/pull/11182
[router][grpc] Refine streaming processes by @CatherineSue in https://github.com/sgl-project/sglang/pull/11277
Fix code sync scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/11276
[Auto Sync] Update test_utils.py (20251006) by @merrymercy in https://github.com/sgl-project/sglang/pull/11280
Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in https://github.com/sgl-project/sglang/pull/11279
Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in https://github.com/sgl-project/sglang/pull/11238
Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/11261
fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11282
docs: update sgl-kernel README by @zhyncs in https://github.com/sgl-project/sglang/pull/11286
chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11281
[router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in https://github.com/sgl-project/sglang/pull/11283
convert test_deterministic into unit tests by @skyzh in https://github.com/sgl-project/sglang/pull/11095
Feature/longbench v2 evaluation utils by @alhridoy in https://github.com/sgl-project/sglang/pull/10949
[ci] fix pp test by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11294
EAGLE cache fix for SWARadixCache by @ispobock in https://github.com/sgl-project/sglang/pull/11231
Remove overlap thread by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11210
[router] add reasoning and tool parser argument in router by @slin1237 in https://github.com/sgl-project/sglang/pull/11290
Remove sampling info events and overlap thread file by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11300
Introduce future indices by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11301
[sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/11068
[Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in https://github.com/sgl-project/sglang/pull/11302
[router] add get server info and get model info in grpc server by @slin1237 in https://github.com/sgl-project/sglang/pull/11303
[router][grpc] Refactor chat template content format detection by @CatherineSue in https://github.com/sgl-project/sglang/pull/11288
[Doc] HiCache Design Documents by @ykwd in https://github.com/sgl-project/sglang/pull/11027
[Doc]: Best Practice for HICache by @hzh0425 in https://github.com/sgl-project/sglang/pull/11001
[router] fix grpc connection conversion and add optimization by @slin1237 in https://github.com/sgl-project/sglang/pull/11305
[router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in https://github.com/sgl-project/sglang/pull/11306
Update tool parser and related documentation by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11223
[router][grpc] Fix error message format in grpc chat handler by @CatherineSue in https://github.com/sgl-project/sglang/pull/11307
[quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in https://github.com/sgl-project/sglang/pull/11205
[router] support Openai router conversation API CRUD by @key4ng in https://github.com/sgl-project/sglang/pull/11297
[router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in https://github.com/sgl-project/sglang/pull/11311
[router] cleanup worker health check to return early by @slin1237 in https://github.com/sgl-project/sglang/pull/11310
[oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults by @CatherineSue in https://github.com/sgl-project/sglang/pull/11304
Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in https://github.com/sgl-project/sglang/pull/11200
ci: unify the model launch method of nightly ci by @mickqian in https://github.com/sgl-project/sglang/pull/11230
[Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in https://github.com/sgl-project/sglang/pull/10710
update sampling_params documentation with defaults by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11315
Optimize copy_kv_cache for spec decoding by @YAMY1234 in https://github.com/sgl-project/sglang/pull/11126
Rename ngram_utils -> ngram_info by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11316
[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in https://github.com/sgl-project/sglang/pull/11314
[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in https://github.com/sgl-project/sglang/pull/9545
[8/N] MoE Refactor: deprecate EPMoE by @ch-wan in https://github.com/sgl-project/sglang/pull/11211
Skip weight loading in deepgemm compilation by @ch-wan in https://github.com/sgl-project/sglang/pull/11312
[2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in https://github.com/sgl-project/sglang/pull/10937
[Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11321
fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in https://github.com/sgl-project/sglang/pull/11007
Support LoRA in bench_serving oai interface by @lifuhuang in https://github.com/sgl-project/sglang/pull/11318
benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in https://github.com/sgl-project/sglang/pull/9812
[CI] improve disaggregation CI. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11264
model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in https://github.com/sgl-project/sglang/pull/10909
[router] refactor generate to use new pipeline arch by @slin1237 in https://github.com/sgl-project/sglang/pull/11323
[router] improve reasoning parser lock and reduce req cloning by @slin1237 in https://github.com/sgl-project/sglang/pull/11336
[router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in https://github.com/sgl-project/sglang/pull/11340
[router] Fix all unused_qualifications by @CatherineSue in https://github.com/sgl-project/sglang/pull/11341
[router] Support history management using conversation by @key4ng in https://github.com/sgl-project/sglang/pull/11339
[router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in https://github.com/sgl-project/sglang/pull/11342
fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in https://github.com/sgl-project/sglang/pull/11327
[Auto Sync] Update scheduler.py (20251009) by @zhyncs in https://github.com/sgl-project/sglang/pull/11350
[Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in https://github.com/sgl-project/sglang/pull/10979
[router][grpc] disable health check generation and increase timeout by @slin1237 in https://github.com/sgl-project/sglang/pull/11353
[router] Refactor OpenAI router: split monolithic file and move location by @key4ng in https://github.com/sgl-project/sglang/pull/11359
[router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in https://github.com/sgl-project/sglang/pull/11366
[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in https://github.com/sgl-project/sglang/pull/11309
[router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in https://github.com/sgl-project/sglang/pull/11373
add code pp support for nixl by @shaharmor98 in https://github.com/sgl-project/sglang/pull/11375
fix bench_serving mishandling of internal states by @shaharmor98 in https://github.com/sgl-project/sglang/pull/11376
[router][grpc] Replace fake health check with correct ones by @CatherineSue in https://github.com/sgl-project/sglang/pull/11387
[router] change grpc client from mutable to clone by @slin1237 in https://github.com/sgl-project/sglang/pull/11394
chore: upgrade flashinfer 0.4.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/11364
[router] conversation item API: create, retrieve and delete by @key4ng in https://github.com/sgl-project/sglang/pull/11369
chore: bump SGLang version to 0.5.3.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11324
move more files under srt/utils by @merrymercy in https://github.com/sgl-project/sglang/pull/11285
[grammar] Avoid server crash when grammar backend is None by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/11401
fix: fix gpu-proc affinity set incorrectly when pp_size > 1 by @acelyc111 in https://github.com/sgl-project/sglang/pull/11389
[Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded by @glenliu21 in https://github.com/sgl-project/sglang/pull/11365
[CI] Refactor PD disaggregation test suite by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11363
Replace pad with cat for better performance by @yuan-luo in https://github.com/sgl-project/sglang/pull/11388
fix: reinstall torch in deps install by @zhyncs in https://github.com/sgl-project/sglang/pull/11414
feat(hicache): Support passing prefix keys for l3 store. by @hzh0425 in https://github.com/sgl-project/sglang/pull/9045
fix file and object naming scheme in HiCacheNixl to avoid data corruption by @ziruiliu in https://github.com/sgl-project/sglang/pull/10969
Dedicated toml files for CPU/XPU by @ZailiWang in https://github.com/sgl-project/sglang/pull/10734
Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in https://github.com/sgl-project/sglang/pull/11144
chore: update pyproject by @zhyncs in https://github.com/sgl-project/sglang/pull/11420
fix: fix video input for qwen3-vl by @mickqian in https://github.com/sgl-project/sglang/pull/11361
perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/11381
[HiCache] feat: add multi tenant with prefix tag by @stmatengss in https://github.com/sgl-project/sglang/pull/9256
[CI] Merge build-dev into workflow matrix by @csahithi in https://github.com/sgl-project/sglang/pull/11345
Revert "perf: optimize qwen-vl with symm mem allreduce" by @ch-wan in https://github.com/sgl-project/sglang/pull/11436
Revert "fix: fix video input for qwen3-vl" by @merrymercy in https://github.com/sgl-project/sglang/pull/11437
Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" by @scottjlee in https://github.com/sgl-project/sglang/pull/11433
[router] Fix ci nvcc not found error by @key4ng in https://github.com/sgl-project/sglang/pull/11411
feat(mooncake): support GB suffix for global_segment_size by @xiaguan in https://github.com/sgl-project/sglang/pull/10745
Separate allocation logic from scheduler by @cctry in https://github.com/sgl-project/sglang/pull/11313
[router] disable rate limiter by default by @slin1237 in https://github.com/sgl-project/sglang/pull/11435
[router] leverage RAII to actively cancel request during client disconnect by @slin1237 in https://github.com/sgl-project/sglang/pull/11399
[router][grpc] Consolidate parser checks for chat completions by @CatherineSue in https://github.com/sgl-project/sglang/pull/11439
Reorder PD disagg CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11438
fix: Change dsv32 hack temporary path to use system temp directory by @wxsms in https://github.com/sgl-project/sglang/pull/11445
Fix batch invariant ops by @hebiao064 in https://github.com/sgl-project/sglang/pull/11368
[BugFix] test_mla_fp8.py fails on Cublas 12.9 by @Liu-congo in https://github.com/sgl-project/sglang/pull/11360
[DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton by @byjiang1996 in https://github.com/sgl-project/sglang/pull/11450
Remove tilelang dependency in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/11455
Enable native ModelOpt quantization support (2/3) by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/9991
Reland [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in https://github.com/sgl-project/sglang/pull/10312
Super tiny delete unused openai router in sgl-router by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11448
Adjust logits metada init for target verify by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11467
[Documentation][Configuration] Server args and documentation of PD-Multiplexing. by @ykcombat in https://github.com/sgl-project/sglang/pull/11427
Fix enable_v2 in int8 quant by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11470
[Fix] Fix split prefill with fa3. by @ykcombat in https://github.com/sgl-project/sglang/pull/11428
fix stop when stream by @whybeyoung in https://github.com/sgl-project/sglang/pull/11462
Add option to disable any_whitespace for xgrammar and llguidance backends. by @lulor in https://github.com/sgl-project/sglang/pull/8919
[7/n] decouple quantization impl from vllm dependency - gguf kernel by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11019
fix Xeon CI by @ZailiWang in https://github.com/sgl-project/sglang/pull/11454
[CI] Add nightly builds to dockerhub by @csahithi in https://github.com/sgl-project/sglang/pull/9804
[Feature] support regex strings as a stopping condition by @glenliu21 in https://github.com/sgl-project/sglang/pull/10635
Beta spec-overlap for EAGLE by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11398
Piecewise CUDA Graph Support & Torch Compile Backend by @Oasis-Git in https://github.com/sgl-project/sglang/pull/10062
[Router]: Small Typo in a comment within tree.rs by @xuwenyihust in https://github.com/sgl-project/sglang/pull/11489
chore: bump sgl-kernel version to 0.3.16 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11476
[smol] [perf] Qwen3-VL in place op. by @vincentzed in https://github.com/sgl-project/sglang/pull/11481
[chore][1/N] Avoid using default mutable parameters by @kevin85421 in https://github.com/sgl-project/sglang/pull/11478
[bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends by @MahmoudAshraf97 in https://github.com/sgl-project/sglang/pull/10172
[ perf ] Replace json-> orjson in hot path by @vincentzed in https://github.com/sgl-project/sglang/pull/11221
[chore][2/N] Avoid using default mutable parameters by @kevin85421 in https://github.com/sgl-project/sglang/pull/11479
Fix the GPT function calling regex to allow dash in the name by @antoine-roux in https://github.com/sgl-project/sglang/pull/10577
bailingMoE: Fix Key error of deepep_mode by @QiuMike in https://github.com/sgl-project/sglang/pull/11465
Fix CI break by express-laned PRs. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11499
Move args from global_config to environ by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11332
move fla env check position by @yizhang2077 in https://github.com/sgl-project/sglang/pull/11500
Temporarily remove b200 tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11501
Fix port conflicts in CI by @merrymercy in https://github.com/sgl-project/sglang/pull/11497
temporarily remove b200 tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11502
Fix unit tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11503
Bugfix: Fix Type consistency for KV indices in SWARadixCache by @hzh0425 in https://github.com/sgl-project/sglang/pull/11452
doc: add doc for adding new models into nightly-ci by @mickqian in https://github.com/sgl-project/sglang/pull/11443
[CI] fix lint by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11509
Deprecate global_server_args_dict by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11331
chore: remove flashinfer cleanup cache by @zhyncs in https://github.com/sgl-project/sglang/pull/11514
fix: revert temporarily remove b200 tests by @zhyncs in https://github.com/sgl-project/sglang/pull/11515
[Fix] Improve longbench prompt and other logics by @byjiang1996 in https://github.com/sgl-project/sglang/pull/11474
Sync changes on io_struct.py and deterministic ops by @merrymercy in https://github.com/sgl-project/sglang/pull/11498
[lint] Fix the lint issue by @ch-wan in https://github.com/sgl-project/sglang/pull/11516
Revert "Deprecate global_server_args_dict" by @ch-wan in https://github.com/sgl-project/sglang/pull/11520
Improve dp attention port assignment scheme by @jokerwyt in https://github.com/sgl-project/sglang/pull/5889
[router] openai router: support grok model by @key4ng in https://github.com/sgl-project/sglang/pull/11511
docs(router): add token-bucket rate limiting to the docs by @Jonahcb in https://github.com/sgl-project/sglang/pull/11485
[sgl-kernel][1/N]Support Expert Specialization Grouped GEMM by @HydraQYH in https://github.com/sgl-project/sglang/pull/11432
Update DeepSeek-R1-FP4 default config on blackwell by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11512
[Fix]: add missing device attribute to ChunkCache by @leavelet in https://github.com/sgl-project/sglang/pull/11493
[Feature] Support mamba radix cache v0 by @yizhang2077 in https://github.com/sgl-project/sglang/pull/11214
ci: improve nightly-ci by @mickqian in https://github.com/sgl-project/sglang/pull/11385
[CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering by @BBuf in https://github.com/sgl-project/sglang/pull/11505
[HICache]: Support 3FS-Store with page_first_direct layout by @hzh0425 in https://github.com/sgl-project/sglang/pull/11460
Tiny fix test run estimated time by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11544
[Reland] perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/11457
Depreate global_server_args_dict by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11528
[Fix] Add per_channel_quant parameter to MoE config functions by @mmangkad in https://github.com/sgl-project/sglang/pull/11201
[router][ci] Add Nightly Release Workflow for SGLang Router by @slin1237 in https://github.com/sgl-project/sglang/pull/11527
[router] add tokenizer path to be dir by @slin1237 in https://github.com/sgl-project/sglang/pull/11530
Remove tp_worker.worker by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11548
fix: fix video input for qwen3-vl by @mickqian in https://github.com/sgl-project/sglang/pull/11442
[NVIDIA] BUMP FA3 by @johnnynunez in https://github.com/sgl-project/sglang/pull/11444
[Fix] Include grpc reflection runtime dependency by @ai-jz in https://github.com/sgl-project/sglang/pull/11419
Adjust overlap event loop by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11507
Move deep gemm related arguments to sglang.srt.environ by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11547
[router][grpc] Further delegate non-stream processing to processing.rs by @CatherineSue in https://github.com/sgl-project/sglang/pull/11553
[router] allow user to specify chat template path by @slin1237 in https://github.com/sgl-project/sglang/pull/11549
Minor: improve sampler & remove unused fields from model_config.py by @merrymercy in https://github.com/sgl-project/sglang/pull/11531
[router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter by @Jonahcb in https://github.com/sgl-project/sglang/pull/11483
Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in https://github.com/sgl-project/sglang/pull/11441
Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) by @trevor-m in https://github.com/sgl-project/sglang/pull/11557
[CI] Add Basic Test for DeepSeek V3.2 by @Fridge003 in https://github.com/sgl-project/sglang/pull/11308
[router][grpc] Add error handling to generate_tool_constraints by @CatherineSue in https://github.com/sgl-project/sglang/pull/11562
[NVIDIA] update pyproject.toml to support cu130 option by @johnnynunez in https://github.com/sgl-project/sglang/pull/11521
[CI Monitor] Ci monitor only deal with main branch in default by @BBuf in https://github.com/sgl-project/sglang/pull/11538
Tiny cleanup fp4 gemm calls by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11537
[router][grpc] Add serve_grpc to launch_server and log id for HealthCheck by @CatherineSue in https://github.com/sgl-project/sglang/pull/11564
[router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/11571
[sgl-kernel][2/N]Support Expert Specialization Grouped GEMM by @HydraQYH in https://github.com/sgl-project/sglang/pull/11534
chore: bump sgl-kernel version to 0.3.16.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11573
Fix accept rate in speculative decoding metrics by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11572
Compilation Folder Reset by @Oasis-Git in https://github.com/sgl-project/sglang/pull/11539
[FEATURE] Add Profile Trace Merger for Distributed Traces by @neelabhsinha in https://github.com/sgl-project/sglang/pull/11413
[DSv32] Use torch.compile for _get_logits_head_gate by @trevor-m in https://github.com/sgl-project/sglang/pull/11565
Make DeepEP combine recv do not overlap by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11535
bench_serving support PD Disaggregation by @BBuf in https://github.com/sgl-project/sglang/pull/11542
Implement LRU eviction policy for LoRA adapters by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/11041
Revert "[NVIDIA] BUMP FA3 (#11444)" by @zhyncs in https://github.com/sgl-project/sglang/pull/11582
chore: bump sgl-kernel version to 0.3.16.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11583
[Auto Sync] Update model_config.py (20251014) by @merrymercy in https://github.com/sgl-project/sglang/pull/11580
Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11587
[router][protocols] Add Axum validate extractor and use it for /v1/chat/completions endpoint by @CatherineSue in https://github.com/sgl-project/sglang/pull/11588
[router] update generate spec to align with sgl io struct by @slin1237 in https://github.com/sgl-project/sglang/pull/11591
[router] change worker api to async instead of sync by @slin1237 in https://github.com/sgl-project/sglang/pull/11566
Update news section in README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/11598
[router] delete useless table content comment in spec by @slin1237 in https://github.com/sgl-project/sglang/pull/11597
[router] allow router launch server to use grpc mode by @slin1237 in https://github.com/sgl-project/sglang/pull/11600
[Docs] [Router]: Update sg-router doc on circuit breaker by @xuwenyihust in https://github.com/sgl-project/sglang/pull/11449
[router] when given both local tokenizer and chat template, log all by @slin1237 in https://github.com/sgl-project/sglang/pull/11601
[AMD CI] Add image and weights caching. by @saienduri in https://github.com/sgl-project/sglang/pull/11593
Update release-docker-dev.yml by @sglang-bot in https://github.com/sgl-project/sglang/pull/11603
Optimize Triton Draft Backend by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11556
Refactor spec decoding metrics calculation into separate TokenizerManager utility function by @scottjlee in https://github.com/sgl-project/sglang/pull/11586
make radix cache deterministic by @skyzh in https://github.com/sgl-project/sglang/pull/10721
move eagle draft post process to cuda graph by @cicirori in https://github.com/sgl-project/sglang/pull/11434
Reduce one step decode for draft model. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11561
[router] add py binding and readme for openai router and history backend by @key4ng in https://github.com/sgl-project/sglang/pull/11453
[router] cleanup app context and move to startup by @slin1237 in https://github.com/sgl-project/sglang/pull/11617
[router] add chang and keyang to sgl router author by @slin1237 in https://github.com/sgl-project/sglang/pull/11620
use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. by @strgrb in https://github.com/sgl-project/sglang/pull/11605
[router] update router readme to latest features by @slin1237 in https://github.com/sgl-project/sglang/pull/11619
Fix log for chunked prefix cache by @Fridge003 in https://github.com/sgl-project/sglang/pull/11624
[Auto Sync] Update scheduler.py, server_args.py (20251014) by @merrymercy in https://github.com/sgl-project/sglang/pull/11623
[Auto Sync] Update collector.py (20251014) by @merrymercy in https://github.com/sgl-project/sglang/pull/11625
[Minor] Update xgrammar dependency by @DarkSharpness in https://github.com/sgl-project/sglang/pull/11622
Update install.md by @merrymercy in https://github.com/sgl-project/sglang/pull/11631
fix: Update SGL_KERNEL_VERSION to 0.3.15 by @zhyncs in https://github.com/sgl-project/sglang/pull/11633
[router][grpc] add warm up to grpc server by @slin1237 in https://github.com/sgl-project/sglang/pull/11627
Refactor kv cache free by @cctry in https://github.com/sgl-project/sglang/pull/11351
[router] update router doc to latest features by @slin1237 in https://github.com/sgl-project/sglang/pull/11639
fix: upgrade transformers to 4.57.1 by @csahithi in https://github.com/sgl-project/sglang/pull/11628
[router] add worker self discovery for metadata by @slin1237 in https://github.com/sgl-project/sglang/pull/11638
[router] upgrade to 0.2.0 by @slin1237 in https://github.com/sgl-project/sglang/pull/11642
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP by @UNIDY2002 in https://github.com/sgl-project/sglang/pull/10423
[1/N]Support DeepSeek-R1 w4a8 normal deepep by @ayrnb in https://github.com/sgl-project/sglang/pull/8247
[Fix] Fix accuracy bug in CSGMV kernel caching key. by @lifuhuang in https://github.com/sgl-project/sglang/pull/11579
feat: add add_chunked_prefix_cache_attention_backend by @zhyncs in https://github.com/sgl-project/sglang/pull/11636
Super tiny improve FA3 import error message by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11590
[BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl by @ZhengWG in https://github.com/sgl-project/sglang/pull/11458
[Doc] Update support matrix for attn and hybrid attn by @b8zhong in https://github.com/sgl-project/sglang/pull/11293
Clean up some Qwen3-Next and deterministic code by @hebiao064 in https://github.com/sgl-project/sglang/pull/11585
docs: update sglang installation guide by @zhyncs in https://github.com/sgl-project/sglang/pull/11659
Tiny cleanup some eagle unused codes by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11660
Fix 1-step draft model forward by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11653
[tool call] Fix prev_tool_call_arr management in base_format_detector.py by @CatherineSue in https://github.com/sgl-project/sglang/pull/11367
[router] Fix response api related spec by @key4ng in https://github.com/sgl-project/sglang/pull/11621
Fix missing json imports in serving_responses.py by @CatherineSue in https://github.com/sgl-project/sglang/pull/11681
[sgl-kernel][3/N]Support Expert Specialization Grouped GEMM by @HydraQYH in https://github.com/sgl-project/sglang/pull/11674
[sgl-kernel] Optimize gguf test by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11667
[router][grpc] Simplify model_id determination by @CatherineSue in https://github.com/sgl-project/sglang/pull/11684
[router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding by @slin1237 in https://github.com/sgl-project/sglang/pull/11676
chore: bump SGLang version to 0.5.3.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11680
[CI][XPU]enable sglang CI on Intel XPU by @DiweiSun in https://github.com/sgl-project/sglang/pull/9493
enable rmsnorm on XPU by @huaiyuzh in https://github.com/sgl-project/sglang/pull/10248
Sync code and test CI; rename some env vars by @merrymercy in https://github.com/sgl-project/sglang/pull/11686
docs: Add Contributor Covenant Code of Conduct by @zhyncs in https://github.com/sgl-project/sglang/pull/11689
[Mamba] Increase default mamba_full_memory_ratio to 0.9 by @hanming-lu in https://github.com/sgl-project/sglang/pull/11679
[PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) by @ShangmingCai in https://github.com/sgl-project/sglang/pull/10912
[sgl-kernel] support hadamard by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11663
Fix missing a2a backend init of GLM4.5 MoE Block by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11692
Split test_intel_amx_attention_backend.py to pass CI of timeout by @yanbing-j in https://github.com/sgl-project/sglang/pull/11370
Set csgmv as default lora backend. by @lifuhuang in https://github.com/sgl-project/sglang/pull/11488
[Bugfix] Fix Qwen3/DSV3/DSV3.2 model support by @iforgetmyname in https://github.com/sgl-project/sglang/pull/11510
[CI] Add GLM4MoE model test by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11706
[router] fix get_models endpoint for openai router by @key4ng in https://github.com/sgl-project/sglang/pull/11687
[ci]use H20 to run disaggregation test by @HanHan009527 in https://github.com/sgl-project/sglang/pull/11543
chore: bump SGLang version to 0.5.3.post3 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11693
model: qwen3-omni (thinker-only) by @mickqian in https://github.com/sgl-project/sglang/pull/10911
[Router] Refactor protocol definitions: split spec.rs into modular files by @key4ng in https://github.com/sgl-project/sglang/pull/11677
[router] fix p and d worker filtering and bootstrap port handling by @slin1237 in https://github.com/sgl-project/sglang/pull/11729
[router][grpc] add dissag info to warm up in grpc server by @slin1237 in https://github.com/sgl-project/sglang/pull/11727
[router] Fix tool_choice normalization in ChatCompletionRequest and fix ut by @CatherineSue in https://github.com/sgl-project/sglang/pull/11731
Revert "make radix cache deterministic" by @Fridge003 in https://github.com/sgl-project/sglang/pull/11728
Reduce the image processing latency in VLM by @zhooooong in https://github.com/sgl-project/sglang/pull/11541
[router] add spec.rs to enables tests under spec folder by @key4ng in https://github.com/sgl-project/sglang/pull/11734
[router] Add rustfmt and set group imports by default by @CatherineSue in https://github.com/sgl-project/sglang/pull/11732
Revert "[router] fix get_models endpoint for openai router (#11687)" by @key4ng in https://github.com/sgl-project/sglang/pull/11740
[router][CI] Clean up deprecated fields in pr-test-pd-router.yml by @CatherineSue in https://github.com/sgl-project/sglang/pull/11739
[CI] Fix broken event loop creation by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11746
[overlap-spec] Make plan stream an option by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11724
ci: reduce and refactor vlm ut and combine test files by @mickqian in https://github.com/sgl-project/sglang/pull/11062
Abstraction for spec worker and code cleanup by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11643
add tuned fuse moe kernel for qwen3 235b fp8 on h200 by @pdasgup in https://github.com/sgl-project/sglang/pull/11730
Revert "Set csgmv as default lora backend. (#11488)" by @zhyncs in https://github.com/sgl-project/sglang/pull/11735
[router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder by @slin1237 in https://github.com/sgl-project/sglang/pull/11766
[router] fix grpc client time out to 1h by @slin1237 in https://github.com/sgl-project/sglang/pull/11768
[doc] update router document by @key4ng in https://github.com/sgl-project/sglang/pull/11767
[Feature] Reuse flashinfer workspace for PD-Multiplexing. by @ykcombat in https://github.com/sgl-project/sglang/pull/11540
Turn on shm_allreduce and shm_allgather for fp16 by @chunyuan-w in https://github.com/sgl-project/sglang/pull/10725
[Auto Sync] Update scheduler.py (20251017) by @zhyncs in https://github.com/sgl-project/sglang/pull/11738
[router][grpc] Remove timeout for connections and remove max_tokens deprecation warning log by @CatherineSue in https://github.com/sgl-project/sglang/pull/11775
Cleaning indexer for DeepSeek V3.2 by @Fridge003 in https://github.com/sgl-project/sglang/pull/11682
[minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests by @merrymercy in https://github.com/sgl-project/sglang/pull/11777
[Auto Sync] Update common.py (20251017) by @merrymercy in https://github.com/sgl-project/sglang/pull/11782
[Fix] Skip visual layers when applying LoRA to Qwen2VL modules by @anvdn in https://github.com/sgl-project/sglang/pull/11519
[Lint] Add python/sglang to ruff F401 checks and remove unused imports in files by @CatherineSue in https://github.com/sgl-project/sglang/pull/11685
Super tiny fix missing input throughput by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11607
Support shared experts overlap in cutlass moe by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11611
Support casting bf16 NextN moe to fp8 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11613
Manually flip deepep_mode for cuda_graph by @zhuzilin in https://github.com/sgl-project/sglang/pull/11666
Set CUDA_VISIBLE_DEVICES to achieve one GPU per process by @merrymercy in https://github.com/sgl-project/sglang/pull/9170
Super tiny fix CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11788
Make single-batch overlap compatible with offloading by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11614
completely remove mixed mode deterministic test as prefix mode could cover it by @zminglei in https://github.com/sgl-project/sglang/pull/11783
[Refactor] move deep_gemm_wrapper out of quantization by @ch-wan in https://github.com/sgl-project/sglang/pull/11784
Enable lint on main by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11794
[router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client by @CatherineSue in https://github.com/sgl-project/sglang/pull/11798
Try add back no-commit-to-branch by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11799
fix(glm45): disable reduce scatter by @jinmingyi1998 in https://github.com/sgl-project/sglang/pull/11665
fix command line usage of profiling by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11793
[RL] support weight update with DP attention by @zhuzilin in https://github.com/sgl-project/sglang/pull/11669
[RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded by @zhuzilin in https://github.com/sgl-project/sglang/pull/10152
set default attention backend for deterministic inference by @zminglei in https://github.com/sgl-project/sglang/pull/11801
Eager Compiler for Torch Compile by @Oasis-Git in https://github.com/sgl-project/sglang/pull/11803
Fix install instructions and pyproject.tomls by @merrymercy in https://github.com/sgl-project/sglang/pull/11781
Bump torch_memory_saver to avoid installing pre-release versions by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11797
[HiCache] feat: add more eviction policy by @stmatengss in https://github.com/sgl-project/sglang/pull/11506
[overlap-spec] support page size > 1 by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11772
support server arg override KV cache to bf16 to avoid slow cases by @b8zhong in https://github.com/sgl-project/sglang/pull/11749
feat(example/fastapi): support --startup-timeout using Qwen3-Next-80B-A3B-Instruct as example by @Kindyaa in https://github.com/sgl-project/sglang/pull/11710
ci: update lmms-eval to speed up multimodal CI by @b8zhong in https://github.com/sgl-project/sglang/pull/11000
Use cutlass fp4 gemm by default by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11813
Fix Dockerfile not installing correct version of DeepEP for arm build by @kyleliang-nv in https://github.com/sgl-project/sglang/pull/11773
[router] Add Configurable L0 and L1 Tokenizer Caching by @slin1237 in https://github.com/sgl-project/sglang/pull/11688
[2/2] [feature] support openai like classification api in router by @whybeyoung in https://github.com/sgl-project/sglang/pull/11670
[1/2][feature] support openai like classification api by @whybeyoung in https://github.com/sgl-project/sglang/pull/11618
make sure logit bias is applied during eagle spec decoding verification by @petricevich in https://github.com/sgl-project/sglang/pull/11555
fix: do not wrap invalid grammar objects during constrained generation by @tazjin in https://github.com/sgl-project/sglang/pull/11328
Improve send_sone script by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11817
Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in https://github.com/sgl-project/sglang/pull/10788
Update CODEOWNERS for layer quantization path by @merrymercy in https://github.com/sgl-project/sglang/pull/11818
support tokenized batch request by @narutolhy in https://github.com/sgl-project/sglang/pull/11091
Tiny add hints when users send requests to wrong place by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11808
Make single-batch overlap compatible with NextN by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11804
Support not officially supported high sgl-kernel version with low srt version by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11786
Avoid generation gets hanging when user specifies multiple event loops by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5162
Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11805
Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11827
[overlap-spec] fix stop condition and trimming by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11819
[Spec Decoding] Support MTP for dsv3.2 by @Paiiiiiiiiiiiiii in https://github.com/sgl-project/sglang/pull/11652
[CI] always print back trace in retry() by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11834
[Test] Add basic matched stop for beta eagle by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11833
Deterministic Mode: Add 1-stage triton kernel for prefill by @hebiao064 in https://github.com/sgl-project/sglang/pull/11147
[logprobs] Enable local deterministic logrprobs testing with strict threshold by @PrinsYin in https://github.com/sgl-project/sglang/pull/10994
[CI] Add CI test for DeepSeek V3.2 MTP by @Fridge003 in https://github.com/sgl-project/sglang/pull/11835
[NVIDIA] FA3/FA4 Fix by @johnnynunez in https://github.com/sgl-project/sglang/pull/11606
[DeepseekV32] Add fast_topk_transform_ragged_fused kernel by @hlu1 in https://github.com/sgl-project/sglang/pull/11815
Fix triton_kernels import error on some hardwares by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11831
Tiny bump DeepEP version in ARM blackwell by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11810
[BugFix] replace the input_to_float8 used in dsv2 by @Liu-congo in https://github.com/sgl-project/sglang/pull/11612
[Doc] Update documents for FA4 by @Fridge003 in https://github.com/sgl-project/sglang/pull/11778
fix(ci): Fix CI Monitor limit parameter and add CI Analysis to summary by @BBuf in https://github.com/sgl-project/sglang/pull/11832
Fix version bump script to handle TOML files with outdated versions by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/11787
Improve Kernel Build Time by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/11508
check master server for mooncake store by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/10510
chore: bump sgl-kernel version to 0.3.16.post3 by @sglang-bot in https://github.com/sgl-project/sglang/pull/11733
Recapture cuda graph after model weight update to resolve IMA error by @harrisonlimh in https://github.com/sgl-project/sglang/pull/11780
[Feature] Use current greenctx stream to communicate in PD-Multiplexing. by @ykcombat in https://github.com/sgl-project/sglang/pull/11594
Support mrope triton kernel and add unit test by @yuan-luo in https://github.com/sgl-project/sglang/pull/11722
[PD] Improve eagle acceptance rate by transferring draft model hidden states by @ZeldaHuang in https://github.com/sgl-project/sglang/pull/10801
Tiny clean up for PD module and doc by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11747
Revert "[CI Monitor] Ci monitor only deal with main branch in default" by @BBuf in https://github.com/sgl-project/sglang/pull/11846
[Model] Add Olmo 3 model support by @2015aroras in https://github.com/sgl-project/sglang/pull/11396
Update amd gpu install docs. by @saienduri in https://github.com/sgl-project/sglang/pull/11849
[AMD CI] Populate image cache in nightly docker release. by @saienduri in https://github.com/sgl-project/sglang/pull/11822
fix(server_args): handle tokenizer init conflicts by @ishandhanani in https://github.com/sgl-project/sglang/pull/11776
[Feature] New structural tag support by @DarkSharpness in https://github.com/sgl-project/sglang/pull/10691
Tiny fix main lint by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11862
[9/N] MoE Refactor: cleanup dispatcher interfaces by @ch-wan in https://github.com/sgl-project/sglang/pull/11847
Fix acc len and gen throughput metrics when enabling overlap-spec by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11823
Replace function call with set literal by @penguin-wwy in https://github.com/sgl-project/sglang/pull/11867
Support mixing cutedsl and deepgemm backend by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11807
[router] Worker Management Workflow Engine by @slin1237 in https://github.com/sgl-project/sglang/pull/11868
[router] remove encoding header for oai router by @slin1237 in https://github.com/sgl-project/sglang/pull/11881
[Auto Sync] Update scheduler.py, server_args.py (20251020) by @merrymercy in https://github.com/sgl-project/sglang/pull/11875
[router][grpc] Remove continue_final_message in ChatTemplateParams and add minijinja-contrib by @CatherineSue in https://github.com/sgl-project/sglang/pull/11882
fix(sql-router): fix conflict port in test by @htiennv in https://github.com/sgl-project/sglang/pull/11826
[router] clean up workflow logs to debug for implementation details logs by @slin1237 in https://github.com/sgl-project/sglang/pull/11886
[code move] move pp into a separate mixin by @merrymercy in https://github.com/sgl-project/sglang/pull/11838
[router][grpc] Fix wram-up random token ids for small models by @CatherineSue in https://github.com/sgl-project/sglang/pull/11887
Revise MRotaryEmbedding's forward by @yuan-luo in https://github.com/sgl-project/sglang/pull/11859
piecewise cuda graph support qwen3-moe by @BBuf in https://github.com/sgl-project/sglang/pull/11845
Fix RotaryEmbedding for fp32 input by @zhangdonghao-zdh in https://github.com/sgl-project/sglang/pull/11843
Init attention backend for Intel XPU by @airMeng in https://github.com/sgl-project/sglang/pull/10656
Use trtllm_mla decode kernel for draft extend in speculative decoding by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11664
[router] release router 0.2.1 by @slin1237 in https://github.com/sgl-project/sglang/pull/11885
[AMD] Update wave-lang to 3.8.0 by @xintin in https://github.com/sgl-project/sglang/pull/11878
init support for KTransformers Heterogeneous Computing by @Atream in https://github.com/sgl-project/sglang/pull/11487
[FEATURE] Add OpenAI-Compatible LoRA Adapter Selection by @neelabhsinha in https://github.com/sgl-project/sglang/pull/11570
[fix] fix ci uv install dependency by @HanHan009527 in https://github.com/sgl-project/sglang/pull/11895
Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] by @whybeyoung in https://github.com/sgl-project/sglang/pull/11416
Simplify multi-tokenizer by @zhengkezhou1 in https://github.com/sgl-project/sglang/pull/11295
[CI] disable glm4.1v and fix the flashinfer installation by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11902
vlm: enforce pybase64 for image and str encode/decode by @b8zhong in https://github.com/sgl-project/sglang/pull/10700
[smol] [perf] Inverse perm improvement by @vincentzed in https://github.com/sgl-project/sglang/pull/11482
[quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n by @kevin85421 in https://github.com/sgl-project/sglang/pull/11702
[CI] Fix b200 flashinfer installation by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11915
Fix flush cache API for spec v2 by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11918
[NVIDIA] Add new SMs support for Spark & Thor by @Kh4L in https://github.com/sgl-project/sglang/pull/11287
Update sgl-kernel and remove fast hadamard depedency by @Fridge003 in https://github.com/sgl-project/sglang/pull/11844
Rename flashmla kernel options of nsa backend for better readability by @Fridge003 in https://github.com/sgl-project/sglang/pull/11876
chore: upgrade flashinfer 0.4.1 by @zhyncs in https://github.com/sgl-project/sglang/pull/11933
[BugFix][Qwen3-VL]: add metadata for video in qwen3-vl by @ZhengWG in https://github.com/sgl-project/sglang/pull/11377
[Auto Sync] Update forward_batch_info.py (20251021) by @zhyncs in https://github.com/sgl-project/sglang/pull/11934
Fix openai input_text type compatibility by @key4ng in https://github.com/sgl-project/sglang/pull/11935
fix: resolve flashinfer 0.4.1 import by @zhyncs in https://github.com/sgl-project/sglang/pull/11940
[router][grpc] Support v1/responses API by @CatherineSue in https://github.com/sgl-project/sglang/pull/11926
[router] Add gRPC E2E test suite by @key4ng in https://github.com/sgl-project/sglang/pull/11790
[router][grpc] Fix background tasks stored with wrong id by @CatherineSue in https://github.com/sgl-project/sglang/pull/11945
[lint] improve ruff check by @hnyls2002 in https://github.com/sgl-project/sglang/pull/11922
[sgl-kernel] support flashmla libtorch by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11717
[NVIDIA] upstream FA4 and fix cccl path by @johnnynunez in https://github.com/sgl-project/sglang/pull/11929
Enable native ModelOpt quantization support (3/3) by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/10154
Fix mooncake dispatcher by @UNIDY2002 in https://github.com/sgl-project/sglang/pull/11908
[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank by @HanHan009527 in https://github.com/sgl-project/sglang/pull/10606
[model] Support POINTSV15Chat model by @josephydu in https://github.com/sgl-project/sglang/pull/9651
Fix flaky hicache test with mooncake backend by @ShangmingCai in https://github.com/sgl-project/sglang/pull/11953
[Fix] Remove unused import from triton_kernels_moe.py by @FlamingoPg in https://github.com/sgl-project/sglang/pull/11967
[router] Support multiple worker URLs for OpenAI router by @key4ng in https://github.com/sgl-project/sglang/pull/11723
[Documentation] add doc for deterministic inference by @zminglei in https://github.com/sgl-project/sglang/pull/11956
[6/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in https://github.com/sgl-project/sglang/pull/10750
[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… by @yuho8818 in https://github.com/sgl-project/sglang/pull/11977
Revert "Recapture cuda graph after model weight update to resolve IMA error " by @merrymercy in https://github.com/sgl-project/sglang/pull/11980
[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel by @jiahanc in https://github.com/sgl-project/sglang/pull/11563
[router] create worker removal step and clean up worker manager by @slin1237 in https://github.com/sgl-project/sglang/pull/11921
Implement BGE-M3 Sparse Embeddings in SGLang by @approximated-intelligence in https://github.com/sgl-project/sglang/pull/10869
[Doc] Update deterministic inference flag in server_arguments.md by @Fridge003 in https://github.com/sgl-project/sglang/pull/11978
[grpc] Support gRPC standard health check by @CatherineSue in https://github.com/sgl-project/sglang/pull/11955
[AMD] Support a new flag to disable quant on parallelLinear layer if required by @yichiche in https://github.com/sgl-project/sglang/pull/11811
[ROCm] Remove vLLM rope dependency & use AITER impl by @b8zhong in https://github.com/sgl-project/sglang/pull/11322
[NVIDIA] Build CUDA 13 by @johnnynunez in https://github.com/sgl-project/sglang/pull/11299
Bump grace blackwell DeepEP version by @fzyzcjy in https://github.com/sgl-project/sglang/pull/11990
[CPU] misc updates by @ZailiWang in https://github.com/sgl-project/sglang/pull/11906
fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 by @zheng1 in https://github.com/sgl-project/sglang/pull/11965
[CPU] Optimize FP16 decode_attention_cpu by @blzheng in https://github.com/sgl-project/sglang/pull/10652
Allow to disable batch decoding. by @LorrinWWW in https://github.com/sgl-project/sglang/pull/11944
Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend by @cicirori in https://github.com/sgl-project/sglang/pull/11985
aiter update to v0.1.6.post1 by @HaiShaw in https://github.com/sgl-project/sglang/pull/12004
Support overlap-spec-v2 with trtllm_mla attention backend by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/11821
Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 by @netanel-haber in https://github.com/sgl-project/sglang/pull/11866
[router] Add comprehensive E2E tests for Response API by @key4ng in https://github.com/sgl-project/sglang/pull/11988
[Router] Consolidate ConnectionMode enum to core module by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/11937
Move memory runtime checker to mixin class by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12014
Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/12015
[Fix] memory leak by overlap + retract by @cctry in https://github.com/sgl-project/sglang/pull/11981
[Feature] Support loading weights from ckpt engine worker by @stmatengss in https://github.com/sgl-project/sglang/pull/11755
[router] change ci names and update log level in ci by @slin1237 in https://github.com/sgl-project/sglang/pull/12021
Feature/nano v2 offline modelopt fp8 and nvfp4 by @netanel-haber in https://github.com/sgl-project/sglang/pull/12018
[Auto Sync] Update test_deterministic_utils.py (20251023) by @merrymercy in https://github.com/sgl-project/sglang/pull/12022
ci: fix night-ci with push retry mechanism by @mickqian in https://github.com/sgl-project/sglang/pull/11765
[router][CI] Clean up imports and prints statements in sgl-router/py_test by @CatherineSue in https://github.com/sgl-project/sglang/pull/12024
Add AWQ quantization support for NPU. by @ErvinXie in https://github.com/sgl-project/sglang/pull/10158
model: support deepseek-ocr by @mickqian in https://github.com/sgl-project/sglang/pull/11891
Log iteration # for prefill and decode by @nvcastet in https://github.com/sgl-project/sglang/pull/9366
Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" by @b8zhong in https://github.com/sgl-project/sglang/pull/12028
Fix mamba radix cache eviction logic in alloc_req_slots by @rogeryoungh in https://github.com/sgl-project/sglang/pull/11616
Update Github action title for kernel build by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/12029
[router] Add builder pattern for RouterConfig with zero duplication by @slin1237 in https://github.com/sgl-project/sglang/pull/12030
Fixed aarch64 flash-mla by @nvjullin in https://github.com/sgl-project/sglang/pull/12009
chore: bump SGLang version to 0.5.4 by @sglang-bot in https://github.com/sgl-project/sglang/pull/12027

@xuwenyihust made their first contribution in https://github.com/sgl-project/sglang/pull/11302
@ziruiliu made their first contribution in https://github.com/sgl-project/sglang/pull/10969
@scottjlee made their first contribution in https://github.com/sgl-project/sglang/pull/11144
@Liu-congo made their first contribution in https://github.com/sgl-project/sglang/pull/11360
@lulor made their first contribution in https://github.com/sgl-project/sglang/pull/8919
@antoine-roux made their first contribution in https://github.com/sgl-project/sglang/pull/10577
@QiuMike made their first contribution in https://github.com/sgl-project/sglang/pull/11465
@ai-jz made their first contribution in https://github.com/sgl-project/sglang/pull/11419
@neelabhsinha made their first contribution in https://github.com/sgl-project/sglang/pull/11413
@UNIDY2002 made their first contribution in https://github.com/sgl-project/sglang/pull/10423
@zhooooong made their first contribution in https://github.com/sgl-project/sglang/pull/11541
@pdasgup made their first contribution in https://github.com/sgl-project/sglang/pull/11730
@anvdn made their first contribution in https://github.com/sgl-project/sglang/pull/11519
@Kindyaa made their first contribution in https://github.com/sgl-project/sglang/pull/11710
@petricevich made their first contribution in https://github.com/sgl-project/sglang/pull/11555
@tazjin made their first contribution in https://github.com/sgl-project/sglang/pull/11328
@Paiiiiiiiiiiiiii made their first contribution in https://github.com/sgl-project/sglang/pull/11652
@2015aroras made their first contribution in https://github.com/sgl-project/sglang/pull/11396
@zhangdonghao-zdh made their first contribution in https://github.com/sgl-project/sglang/pull/11843
@xintin made their first contribution in https://github.com/sgl-project/sglang/pull/11878
@zhengkezhou1 made their first contribution in https://github.com/sgl-project/sglang/pull/11295
@Kh4L made their first contribution in https://github.com/sgl-project/sglang/pull/11287
@yuho8818 made their first contribution in https://github.com/sgl-project/sglang/pull/11977
@jiahanc made their first contribution in https://github.com/sgl-project/sglang/pull/11563
@approximated-intelligence made their first contribution in https://github.com/sgl-project/sglang/pull/10869
@zheng1 made their first contribution in https://github.com/sgl-project/sglang/pull/11965
@ErvinXie made their first contribution in https://github.com/sgl-project/sglang/pull/10158
@rogeryoungh made their first contribution in https://github.com/sgl-project/sglang/pull/11616
@nvjullin made their first contribution in https://github.com/sgl-project/sglang/pull/12009

sglang

Release v0.5.4

Highlights

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

Release v0.5.4

Highlights

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

What's Changed

New Contributors

yt-dlp