v0.14.0

Highlights

This release features approximately 660 commits from 251 contributors (86 new contributors).

Breaking Changes:

Async scheduling is now enabled by default - Users who experience issues can disable with --no-async-scheduling.
- Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding.
PyTorch 2.9.1 is now required and the default wheel is compiled against cu129.
Deprecated quantization schemes have been removed (#31688, #31285).
When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).

Key Improvements:

Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821).
gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing.
--max-model-len auto (#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures.
Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by specifying VLLM_LOG_MODEL_INSPECTION=1 or by simply printing the LLM object.
Model Runner V2 enhancements: UVA block tables (#31965), M-RoPE (#32143), logit_bias/allowed_token_ids/min_tokens support (#32163).
- Please note that Model Runner V2 is still experimental and disabled by default.

Model Support

New Model Architectures:

Grok-2 with tiktoken tokenizer (#31847)
LFM2-VL vision-language model (#31758)
MiMo-V2-Flash (#30836)
openPangu MoE (#28775)
IQuestCoder (#31575)
Nemotron Parse 1.1 (#30864)
GLM-ASR audio (#31436)
Isaac vision model v0.1/v0.2 (#28367, #31550)
Kanana-1.5-v-3b-instruct (#29384)
K-EXAONE-236B-A23B MoE (#31621)

LoRA Support Expansion:

Multimodal tower/connector LoRA (#26674): LLaVA (#31513), BLIP2 (#31620), PaliGemma (#31656), Pixtral (#31724), DotsOCR (#31825), GLM4-V (#31652)
DeepSeek-OCR (#31569), Qwen3-Next (#31719), NemotronH (#31539), PLaMo 2/3 (#31322)
Vision LoRA mm_processor_cache support (#31927)
MoE expert base_layer loading (#31104)

Model Enhancements:

Qwen3-VL as reranker (#31890)
DeepSeek v3.2 chat prefix completion (#31147)
GLM-4.5/GLM-4.7 enable_thinking: false (#31788)
Ernie4.5-VL video timestamps (#31274)
Score template expansion (#31335)
LLaMa4 vision encoder compilation (#30709)
NemotronH quantized attention (#31898)

Engine Core

Async scheduling default with spec decode (#27614, #31998) and structured outputs (#29821)
Hybrid allocator + KV connector (#30166) with multiple KV cache groups (#31707)
Triton attention: encoder-only/cross attention (#31406), cross-layer blocks (#30687)
Mamba2 prefix cache optimization (#28047)
Batch invariant LoRA (#30097)
LoRA name in BlockStored for KV-cache reconstruction (#27577)
Request ID collision prevention (#27987)
Dense model DP without overhead (#30739)
Async + spec decode penalties/bad_words (#30495)

Hardware & Performance

CUTLASS MoE Optimizations:

2.9% throughput + 10.8% TTFT via fill(0) optimization (#31754)
5.3% throughput + 2.2% TTFT via problem size calculation (#31830)
Fused SiLU+Mul+Quant for NVFP4 (#31832)
NVFP4 stride fusion (#31837)

Other Performance:

GDN attention decode speedup (Qwen3-Next) (#31722)
Fused RoPE + MLA KV-cache write (#25774)
Sliding window attention optimization (#31984)
FlashInfer DeepGEMM swapAB SM90 (#29213)
Unpermute-aware fused MoE + small-batch fallback (#29354)
GDN Attention blocking copy removal (#31167)
FusedMoE LoRA small rank performance (#32019)
EPLB numpy optimization (#29499)
FlashInfer rotary for DeepSeek (#30729)
Vectorized activations (#29512)
NUMA interleaved memory (#30800)
Async spec decode logprobs (#31336)

Hardware Configs:

SM103 support (#30705, #31150)
B300 Blackwell MoE configs (#30629)
Qwen3-Next FP8 CUTLASS configs (#29553)
Qwen3Moe B200 Triton configs (#31448)
GLM-4.5/4.6 RTX Pro 6000 kernels (#31407)
MiniMax-M2/M2.1 QKNorm (#31493)
NVFP4 small batch tuning (#30897)

Platform:

ROCm: AITER RMSNorm fusion (#26575), MTP for AITER MLA (#28624), moriio connector (#29304), xgrammar upstream (#31327)
XPU: FP8 streaming quant (#30944), custom workers (#30935)
CPU: Head sizes 80/112 (#31968), async disabled by default (#31525), LoRA MoE CPU pinning (#31317)
TPU: tpu-inference path (#30808), Sophgo docs (#30949)

Large Scale Serving

XBO (Extended Dual-Batch Overlap) (#30120)
NIXL asymmetric TP (P > D tensor-parallel-size) (#27274)
NIXL heterogeneous BlockSize/kv_layout (#30275)
Cross-layers KV layout for MultiConnector (#30761)
Mooncake protocol expansion (#30133)
LMCache KV cache registration (#31397)
EPLB default all2all backend (#30559)

Quantization

Marlin for Turing (sm75) (#29901, #31000)
Quark int4-fp8 w4a8 MoE (#30071)
MXFP4 W4A16 dense models (#31926)
ModelOpt FP8 variants (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) (#30957)
ModelOpt KV cache quantization update (#31895)
NVFP4 Marlin for NVFP4A16 MoEs (#30881)
Static quant all group shapes (#30833)
Default MXFP4 LoRA backend: Marlin (#30598)
compressed-tensors 0.13.0 (#30799)

API & Frontend

New Features:

gRPC server (#30190)
--max-model-len auto (#29431)
Model inspection view (#29450)
Offline FastAPI docs (#30184)
attention_config in LLM() (#30710)
MFU metrics (#30738)
Iteration logging + NVTX (#31193)
reasoning_effort parameter (#31956)

Tool Calling:

FunctionGemma parser (#31218)
GLM-4.7 parser (#30876)
Kimi K2 update (#31207)

CLI:

-ep for --enable-expert-parallel (#30890)
Complete help messages (#31226)
Bench serve auto-discovery + --input-len (#30816)
Spec decode acceptance stats (#31739)
--enable-log-deltas (renamed) (#32020)
--default-chat-template-kwargs (#31343)

API:

/server_info env info (#31899)
MCP streaming in Responses API (#31761)
/embeddings continue_final_message (#31497)
Reranking score templates (#30550)
Chat template warmup (#30700)
Configurable handshake timeout (#27444)
Better 500 errors (#20610)
Worker init logging (#29493)
Bench error reporting (#31808)
Corrupted video recovery (#29197)
Spec-decode param validation (#31982)
Validation error metadata (#30134)

Security

Prevent token leaks in crash logs (#30751)
weights_only=True in torch.load (#32045)

Dependencies

PyTorch 2.9.1 (#28495)
compressed-tensors 0.13.0 (#30799)
CUDA 13 LMCache/NIXL in Docker (#30913)
Configurable NVSHMEM version (#30732)

Bug Fixes (User-Facing)

Invalid UTF-8 tokens (#28874)
CPU RoPE gibberish with --enforce-eager (#31643)
Tool call streaming finish chunk (#31438)
Encoder cache leak CPU scheduling stuck (#31857)
Engine crash: tools + response_format (#32127)
Voxtral transcription API (#31388)
Safetensors download optimization (#30537)

Deprecations

Deprecated quantization schemes removed (#31688, #31285)
seed_everything deprecated (#31659)

Documentation

vllm-metal plugin docs (#31174)
Claude Code example (#31188)
CustomOp developer guide (#30886)

New Contributors 🎉

@penfree made their first contribution in https://github.com/vllm-project/vllm/pull/30237
@jiangkuaixue123 made their first contribution in https://github.com/vllm-project/vllm/pull/30120
@jr-shen made their first contribution in https://github.com/vllm-project/vllm/pull/29663
@grzegorz-k-karch made their first contribution in https://github.com/vllm-project/vllm/pull/30795
@shanjiaz made their first contribution in https://github.com/vllm-project/vllm/pull/30799
@Somoku made their first contribution in https://github.com/vllm-project/vllm/pull/29569
@baoqian426 made their first contribution in https://github.com/vllm-project/vllm/pull/30841
@SongDI911 made their first contribution in https://github.com/vllm-project/vllm/pull/30852
@www-spam made their first contribution in https://github.com/vllm-project/vllm/pull/30827
@Xunzhuo made their first contribution in https://github.com/vllm-project/vllm/pull/30844
@TheCodeWrangler made their first contribution in https://github.com/vllm-project/vllm/pull/30700
@SungMinCho made their first contribution in https://github.com/vllm-project/vllm/pull/30738
@sarathc-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/30188
@wzyrrr made their first contribution in https://github.com/vllm-project/vllm/pull/30949
@navmarri14 made their first contribution in https://github.com/vllm-project/vllm/pull/30629
@HaloWorld made their first contribution in https://github.com/vllm-project/vllm/pull/30867
@jeffreywang-anyscale made their first contribution in https://github.com/vllm-project/vllm/pull/31013
@AmeenP made their first contribution in https://github.com/vllm-project/vllm/pull/31093
@westers made their first contribution in https://github.com/vllm-project/vllm/pull/31071
@CedricHwong made their first contribution in https://github.com/vllm-project/vllm/pull/30957
@c0de128 made their first contribution in https://github.com/vllm-project/vllm/pull/31114
@Bounty-hunter made their first contribution in https://github.com/vllm-project/vllm/pull/30242
@jzakrzew made their first contribution in https://github.com/vllm-project/vllm/pull/30550
@1643661061leo made their first contribution in https://github.com/vllm-project/vllm/pull/30760
@NickCao made their first contribution in https://github.com/vllm-project/vllm/pull/30070
@amithkk made their first contribution in https://github.com/vllm-project/vllm/pull/31212
@gateremark made their first contribution in https://github.com/vllm-project/vllm/pull/31218
@Tiiiktak made their first contribution in https://github.com/vllm-project/vllm/pull/31274

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.13.0...v0.14.0

Highlights

This release features approximately 660 commits from 251 contributors (86 new contributors).

Breaking Changes:

Async scheduling is now enabled by default - Users who experience issues can disable with --no-async-scheduling.
- Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding.
PyTorch 2.9.1 is now required and the default wheel is compiled against cu129.
Deprecated quantization schemes have been removed (#31688, #31285).
When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).

Key Improvements:

Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821).
gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing.
--max-model-len auto (#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures.
Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by specifying VLLM_LOG_MODEL_INSPECTION=1 or by simply printing the LLM object.
Model Runner V2 enhancements: UVA block tables (#31965), M-RoPE (#32143), logit_bias/allowed_token_ids/min_tokens support (#32163).
- Please note that Model Runner V2 is still experimental and disabled by default.

Model Support

New Model Architectures:

Grok-2 with tiktoken tokenizer (#31847)
LFM2-VL vision-language model (#31758)
MiMo-V2-Flash (#30836)
openPangu MoE (#28775)
IQuestCoder (#31575)
Nemotron Parse 1.1 (#30864)
GLM-ASR audio (#31436)
Isaac vision model v0.1/v0.2 (#28367, #31550)
Kanana-1.5-v-3b-instruct (#29384)
K-EXAONE-236B-A23B MoE (#31621)

LoRA Support Expansion:

Multimodal tower/connector LoRA (#26674): LLaVA (#31513), BLIP2 (#31620), PaliGemma (#31656), Pixtral (#31724), DotsOCR (#31825), GLM4-V (#31652)
DeepSeek-OCR (#31569), Qwen3-Next (#31719), NemotronH (#31539), PLaMo 2/3 (#31322)
Vision LoRA mm_processor_cache support (#31927)
MoE expert base_layer loading (#31104)

Model Enhancements:

Qwen3-VL as reranker (#31890)
DeepSeek v3.2 chat prefix completion (#31147)
GLM-4.5/GLM-4.7 enable_thinking: false (#31788)
Ernie4.5-VL video timestamps (#31274)
Score template expansion (#31335)
LLaMa4 vision encoder compilation (#30709)
NemotronH quantized attention (#31898)

Engine Core

Async scheduling default with spec decode (#27614, #31998) and structured outputs (#29821)
Hybrid allocator + KV connector (#30166) with multiple KV cache groups (#31707)
Triton attention: encoder-only/cross attention (#31406), cross-layer blocks (#30687)
Mamba2 prefix cache optimization (#28047)
Batch invariant LoRA (#30097)
LoRA name in BlockStored for KV-cache reconstruction (#27577)
Request ID collision prevention (#27987)
Dense model DP without overhead (#30739)
Async + spec decode penalties/bad_words (#30495)

Hardware & Performance

CUTLASS MoE Optimizations:

2.9% throughput + 10.8% TTFT via fill(0) optimization (#31754)
5.3% throughput + 2.2% TTFT via problem size calculation (#31830)
Fused SiLU+Mul+Quant for NVFP4 (#31832)
NVFP4 stride fusion (#31837)

Other Performance:

GDN attention decode speedup (Qwen3-Next) (#31722)
Fused RoPE + MLA KV-cache write (#25774)
Sliding window attention optimization (#31984)
FlashInfer DeepGEMM swapAB SM90 (#29213)
Unpermute-aware fused MoE + small-batch fallback (#29354)
GDN Attention blocking copy removal (#31167)
FusedMoE LoRA small rank performance (#32019)
EPLB numpy optimization (#29499)
FlashInfer rotary for DeepSeek (#30729)
Vectorized activations (#29512)
NUMA interleaved memory (#30800)
Async spec decode logprobs (#31336)

Hardware Configs:

SM103 support (#30705, #31150)
B300 Blackwell MoE configs (#30629)
Qwen3-Next FP8 CUTLASS configs (#29553)
Qwen3Moe B200 Triton configs (#31448)
GLM-4.5/4.6 RTX Pro 6000 kernels (#31407)
MiniMax-M2/M2.1 QKNorm (#31493)
NVFP4 small batch tuning (#30897)

Platform:

ROCm: AITER RMSNorm fusion (#26575), MTP for AITER MLA (#28624), moriio connector (#29304), xgrammar upstream (#31327)
XPU: FP8 streaming quant (#30944), custom workers (#30935)
CPU: Head sizes 80/112 (#31968), async disabled by default (#31525), LoRA MoE CPU pinning (#31317)
TPU: tpu-inference path (#30808), Sophgo docs (#30949)

Large Scale Serving

XBO (Extended Dual-Batch Overlap) (#30120)
NIXL asymmetric TP (P > D tensor-parallel-size) (#27274)
NIXL heterogeneous BlockSize/kv_layout (#30275)
Cross-layers KV layout for MultiConnector (#30761)
Mooncake protocol expansion (#30133)
LMCache KV cache registration (#31397)
EPLB default all2all backend (#30559)

Quantization

Marlin for Turing (sm75) (#29901, #31000)
Quark int4-fp8 w4a8 MoE (#30071)
MXFP4 W4A16 dense models (#31926)
ModelOpt FP8 variants (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) (#30957)
ModelOpt KV cache quantization update (#31895)
NVFP4 Marlin for NVFP4A16 MoEs (#30881)
Static quant all group shapes (#30833)
Default MXFP4 LoRA backend: Marlin (#30598)
compressed-tensors 0.13.0 (#30799)

API & Frontend

New Features:

gRPC server (#30190)
--max-model-len auto (#29431)
Model inspection view (#29450)
Offline FastAPI docs (#30184)
attention_config in LLM() (#30710)
MFU metrics (#30738)
Iteration logging + NVTX (#31193)
reasoning_effort parameter (#31956)

Tool Calling:

FunctionGemma parser (#31218)
GLM-4.7 parser (#30876)
Kimi K2 update (#31207)

CLI:

-ep for --enable-expert-parallel (#30890)
Complete help messages (#31226)
Bench serve auto-discovery + --input-len (#30816)
Spec decode acceptance stats (#31739)
--enable-log-deltas (renamed) (#32020)
--default-chat-template-kwargs (#31343)

API:

/server_info env info (#31899)
MCP streaming in Responses API (#31761)
/embeddings continue_final_message (#31497)
Reranking score templates (#30550)
Chat template warmup (#30700)
Configurable handshake timeout (#27444)
Better 500 errors (#20610)
Worker init logging (#29493)
Bench error reporting (#31808)
Corrupted video recovery (#29197)
Spec-decode param validation (#31982)
Validation error metadata (#30134)

Security

Prevent token leaks in crash logs (#30751)
weights_only=True in torch.load (#32045)

Dependencies

PyTorch 2.9.1 (#28495)
compressed-tensors 0.13.0 (#30799)
CUDA 13 LMCache/NIXL in Docker (#30913)
Configurable NVSHMEM version (#30732)

Bug Fixes (User-Facing)

Invalid UTF-8 tokens (#28874)
CPU RoPE gibberish with --enforce-eager (#31643)
Tool call streaming finish chunk (#31438)
Encoder cache leak CPU scheduling stuck (#31857)
Engine crash: tools + response_format (#32127)
Voxtral transcription API (#31388)
Safetensors download optimization (#30537)

Deprecations

Deprecated quantization schemes removed (#31688, #31285)
seed_everything deprecated (#31659)

Documentation

vllm-metal plugin docs (#31174)
Claude Code example (#31188)
CustomOp developer guide (#30886)

New Contributors 🎉

@penfree made their first contribution in https://github.com/vllm-project/vllm/pull/30237
@jiangkuaixue123 made their first contribution in https://github.com/vllm-project/vllm/pull/30120
@jr-shen made their first contribution in https://github.com/vllm-project/vllm/pull/29663
@grzegorz-k-karch made their first contribution in https://github.com/vllm-project/vllm/pull/30795
@shanjiaz made their first contribution in https://github.com/vllm-project/vllm/pull/30799
@Somoku made their first contribution in https://github.com/vllm-project/vllm/pull/29569
@baoqian426 made their first contribution in https://github.com/vllm-project/vllm/pull/30841
@SongDI911 made their first contribution in https://github.com/vllm-project/vllm/pull/30852
@www-spam made their first contribution in https://github.com/vllm-project/vllm/pull/30827
@Xunzhuo made their first contribution in https://github.com/vllm-project/vllm/pull/30844
@TheCodeWrangler made their first contribution in https://github.com/vllm-project/vllm/pull/30700
@SungMinCho made their first contribution in https://github.com/vllm-project/vllm/pull/30738
@sarathc-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/30188
@wzyrrr made their first contribution in https://github.com/vllm-project/vllm/pull/30949
@navmarri14 made their first contribution in https://github.com/vllm-project/vllm/pull/30629
@HaloWorld made their first contribution in https://github.com/vllm-project/vllm/pull/30867
@jeffreywang-anyscale made their first contribution in https://github.com/vllm-project/vllm/pull/31013
@AmeenP made their first contribution in https://github.com/vllm-project/vllm/pull/31093
@westers made their first contribution in https://github.com/vllm-project/vllm/pull/31071
@CedricHwong made their first contribution in https://github.com/vllm-project/vllm/pull/30957
@c0de128 made their first contribution in https://github.com/vllm-project/vllm/pull/31114
@Bounty-hunter made their first contribution in https://github.com/vllm-project/vllm/pull/30242
@jzakrzew made their first contribution in https://github.com/vllm-project/vllm/pull/30550
@1643661061leo made their first contribution in https://github.com/vllm-project/vllm/pull/30760
@NickCao made their first contribution in https://github.com/vllm-project/vllm/pull/30070
@amithkk made their first contribution in https://github.com/vllm-project/vllm/pull/31212
@gateremark made their first contribution in https://github.com/vllm-project/vllm/pull/31218
@Tiiiktak made their first contribution in https://github.com/vllm-project/vllm/pull/31274

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.13.0...v0.14.0

vllm

Highlights

Model Support

Engine Core

Hardware & Performance

Large Scale Serving

Quantization

API & Frontend

Security

Dependencies

Bug Fixes (User-Facing)

Deprecations

Documentation

New Contributors 🎉

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

v0.14.0

Highlights

Model Support

Engine Core

Hardware & Performance

Large Scale Serving

Quantization

API & Frontend

Security

Dependencies

Bug Fixes (User-Facing)

Deprecations

Documentation

New Contributors 🎉

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

yt-dlp