v0.10.2

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

[Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23064
[Misc] enhance static type hint by @andyxning in https://github.com/vllm-project/vllm/pull/23059
[Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in https://github.com/vllm-project/vllm/pull/23058
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in https://github.com/vllm-project/vllm/pull/23055
[Misc] refactor function name by @andyxning in https://github.com/vllm-project/vllm/pull/23029
[Misc] Fix backward compatibility from #23030 by @ywang96 in https://github.com/vllm-project/vllm/pull/23070
[XPU] Fix compile size for xpu by @jikunshang in https://github.com/vllm-project/vllm/pull/23069
[XPU][CI]add xpu env vars in CI scripts by @jikunshang in https://github.com/vllm-project/vllm/pull/22946
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23053
[Bugfix] fix IntermediateTensors equal method by @andyxning in https://github.com/vllm-project/vllm/pull/23027
[Refactor] Get prompt updates earlier by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23097
chore: remove unnecessary patch_padding_side for the chatglm model by @carlory in https://github.com/vllm-project/vllm/pull/23090
[Bugfix] Support compile for Transformers multimodal by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/23095
[CI Bugfix] Pin openai<1.100 to unblock CI by @mgoin in https://github.com/vllm-project/vllm/pull/23118
fix: OpenAI SDK compat (ResponseTextConfig) by @h-brenoskuk in https://github.com/vllm-project/vllm/pull/23126
Use Blackwell FlashInfer MXFP4 MoE by default if available by @mgoin in https://github.com/vllm-project/vllm/pull/23008
Install tpu_info==0.4.0 to fix core dump for TPU by @xiangxu-google in https://github.com/vllm-project/vllm/pull/23135
[Misc] Minor refactoring for prepare_inputs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23116
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23041
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code by @tdoublep in https://github.com/vllm-project/vllm/pull/23122
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/22871
[V0 Deprecation] Remove V0 FlashInfer attention backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22776
chore: disable enable_cpp_symbolic_shape_guards by @xiszishu in https://github.com/vllm-project/vllm/pull/23048

New Contributors

@DoubleVII made their first contribution in https://github.com/vllm-project/vllm/pull/23058
@carlory made their first contribution in https://github.com/vllm-project/vllm/pull/23090
@nikheal2 made their first contribution in https://github.com/vllm-project/vllm/pull/22725
@Tialo made their first contribution in https://github.com/vllm-project/vllm/pull/23172
@myselvess made their first contribution in https://github.com/vllm-project/vllm/pull/23084
@yiz-liu made their first contribution in https://github.com/vllm-project/vllm/pull/23169
@ultmaster made their first contribution in https://github.com/vllm-project/vllm/pull/22587
@KilJaeeun made their first contribution in https://github.com/vllm-project/vllm/pull/22790
@wzshiming made their first contribution in https://github.com/vllm-project/vllm/pull/23242
@misrasaurabh1 made their first contribution in https://github.com/vllm-project/vllm/pull/20413
@yannqi made their first contribution in https://github.com/vllm-project/vllm/pull/23246
@jaredoconnell made their first contribution in https://github.com/vllm-project/vllm/pull/23306
@paulpak58 made their first contribution in https://github.com/vllm-project/vllm/pull/22845
@zhuangqh made their first contribution in https://github.com/vllm-project/vllm/pull/23309
@tvalentyn made their first contribution in https://github.com/vllm-project/vllm/pull/23270
@arjunbreddy22 made their first contribution in https://github.com/vllm-project/vllm/pull/22495
@philipchung made their first contribution in https://github.com/vllm-project/vllm/pull/17149
@FoolPlayer made their first contribution in https://github.com/vllm-project/vllm/pull/23241
@namanlalitnyu made their first contribution in https://github.com/vllm-project/vllm/pull/23375
@hickeyma made their first contribution in https://github.com/vllm-project/vllm/pull/23353
@PapaGoose made their first contribution in https://github.com/vllm-project/vllm/pull/23337
@bppps made their first contribution in https://github.com/vllm-project/vllm/pull/23366
@AzizCode92 made their first contribution in https://github.com/vllm-project/vllm/pull/23416
@fengli1702 made their first contribution in https://github.com/vllm-project/vllm/pull/22527
@FFFfff1FFFfff made their first contribution in https://github.com/vllm-project/vllm/pull/23408
@ayushsatyam146 made their first contribution in https://github.com/vllm-project/vllm/pull/23171
@patemotter made their first contribution in https://github.com/vllm-project/vllm/pull/23574
@Terrencezzj made their first contribution in https://github.com/vllm-project/vllm/pull/23584

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.10.1.1...v0.10.2rc3

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

[Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23064
[Misc] enhance static type hint by @andyxning in https://github.com/vllm-project/vllm/pull/23059
[Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in https://github.com/vllm-project/vllm/pull/23058
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in https://github.com/vllm-project/vllm/pull/23055
[Misc] refactor function name by @andyxning in https://github.com/vllm-project/vllm/pull/23029
[Misc] Fix backward compatibility from #23030 by @ywang96 in https://github.com/vllm-project/vllm/pull/23070
[XPU] Fix compile size for xpu by @jikunshang in https://github.com/vllm-project/vllm/pull/23069
[XPU][CI]add xpu env vars in CI scripts by @jikunshang in https://github.com/vllm-project/vllm/pull/22946
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23053
[Bugfix] fix IntermediateTensors equal method by @andyxning in https://github.com/vllm-project/vllm/pull/23027
[Refactor] Get prompt updates earlier by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23097
chore: remove unnecessary patch_padding_side for the chatglm model by @carlory in https://github.com/vllm-project/vllm/pull/23090
[Bugfix] Support compile for Transformers multimodal by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/23095
[CI Bugfix] Pin openai<1.100 to unblock CI by @mgoin in https://github.com/vllm-project/vllm/pull/23118
fix: OpenAI SDK compat (ResponseTextConfig) by @h-brenoskuk in https://github.com/vllm-project/vllm/pull/23126
Use Blackwell FlashInfer MXFP4 MoE by default if available by @mgoin in https://github.com/vllm-project/vllm/pull/23008
Install tpu_info==0.4.0 to fix core dump for TPU by @xiangxu-google in https://github.com/vllm-project/vllm/pull/23135
[Misc] Minor refactoring for prepare_inputs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23116
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23041
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code by @tdoublep in https://github.com/vllm-project/vllm/pull/23122
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/22871
[V0 Deprecation] Remove V0 FlashInfer attention backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22776
chore: disable enable_cpp_symbolic_shape_guards by @xiszishu in https://github.com/vllm-project/vllm/pull/23048

New Contributors

@DoubleVII made their first contribution in https://github.com/vllm-project/vllm/pull/23058
@carlory made their first contribution in https://github.com/vllm-project/vllm/pull/23090
@nikheal2 made their first contribution in https://github.com/vllm-project/vllm/pull/22725
@Tialo made their first contribution in https://github.com/vllm-project/vllm/pull/23172
@myselvess made their first contribution in https://github.com/vllm-project/vllm/pull/23084
@yiz-liu made their first contribution in https://github.com/vllm-project/vllm/pull/23169
@ultmaster made their first contribution in https://github.com/vllm-project/vllm/pull/22587
@KilJaeeun made their first contribution in https://github.com/vllm-project/vllm/pull/22790
@wzshiming made their first contribution in https://github.com/vllm-project/vllm/pull/23242
@misrasaurabh1 made their first contribution in https://github.com/vllm-project/vllm/pull/20413
@yannqi made their first contribution in https://github.com/vllm-project/vllm/pull/23246
@jaredoconnell made their first contribution in https://github.com/vllm-project/vllm/pull/23306
@paulpak58 made their first contribution in https://github.com/vllm-project/vllm/pull/22845
@zhuangqh made their first contribution in https://github.com/vllm-project/vllm/pull/23309
@tvalentyn made their first contribution in https://github.com/vllm-project/vllm/pull/23270
@arjunbreddy22 made their first contribution in https://github.com/vllm-project/vllm/pull/22495
@philipchung made their first contribution in https://github.com/vllm-project/vllm/pull/17149
@FoolPlayer made their first contribution in https://github.com/vllm-project/vllm/pull/23241
@namanlalitnyu made their first contribution in https://github.com/vllm-project/vllm/pull/23375
@hickeyma made their first contribution in https://github.com/vllm-project/vllm/pull/23353
@PapaGoose made their first contribution in https://github.com/vllm-project/vllm/pull/23337
@bppps made their first contribution in https://github.com/vllm-project/vllm/pull/23366
@AzizCode92 made their first contribution in https://github.com/vllm-project/vllm/pull/23416
@fengli1702 made their first contribution in https://github.com/vllm-project/vllm/pull/22527
@FFFfff1FFFfff made their first contribution in https://github.com/vllm-project/vllm/pull/23408
@ayushsatyam146 made their first contribution in https://github.com/vllm-project/vllm/pull/23171
@patemotter made their first contribution in https://github.com/vllm-project/vllm/pull/23574
@Terrencezzj made their first contribution in https://github.com/vllm-project/vllm/pull/23584

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.10.1.1...v0.10.2rc3

vllm

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

Breaking Changes

What's Changed

New Contributors

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

yt-dlp

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

yt-dlp

v0.10.2

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

Breaking Changes

What's Changed

New Contributors