Dynamo 0.8.0 continues the journey toward production-grade LLM serving with a Kubernetes-native architecture, expanded multimodal and agentic support, and enterprise-ready observability. This release reduces infrastructure complexity while providing a seamless experience regardless of which LLM framework you choose:
SGLang
TRT-LLM
vLLM
Kubernetes-Native Infrastructure
In order to address limitations on scaling from etcd and NATS, Dynamo 0.8.0 makes both optional for discovery and request planes: Kubernetes-native service discovery via EndpointSlices replaces etcd, and a transport-agnostic request plane with TCP as the default replaces NATS. Validation webhooks catch CRD errors at submission time, and the operator manages health checks and scaling directly. These changes leverage Kubernetes primitives rather than working around them.
Multimodal Support
Dynamo 0.8.0 expands multimodal support across all backends. Audio inference for vLLM enables models like Qwen2-Audio, and a new frontend video decoder handles video input with configurable frame sampling for SGLang, TRT-LLM, and vLLM. Llama4 multimodal now works in disaggregated prefill/decode mode, and KV-aware routing supports multimodal requests end-to-end for TensorRT-LLM. Security controls allow operators to restrict multimodal content sources in production environments.
Agentic Workflows
As AI applications evolve from single-turn inference toward autonomous agents that reason, plan, and take action, Dynamo is building the infrastructure to support these workflows. Tool calling is now available for DeepSeek V3/R1/V3.2, Qwen3 Coder, and Jamba model families—the models powering today's most capable agents. Named and Required tool choice modes give explicit control over tool selection, and schema-aware type conversion ensures parameter values match their declared types. The nvext extension field provides worker_id, TTFT, and per-request timing for debugging multi-step agent pipelines.
Disaggregated Serving Performance
Prefill/decode disaggregation at scale requires efficient coordination. Local KV indexers for SGLang and TRT-LLM reduce overhead with the central indexer, while dynamic rejection thresholds and early rejection protect decode workers from overload. Request cancellation now propagates cleanly during prefill-to-decode transitions, and frontend-based prefill routing for SGLang simplifies deployment topology. Non-blocking radix snapshots and async-first NIXL APIs improve transfer throughput across workers.
Production Observability and Resilience
Operating LLM infrastructure requires visibility and fault tolerance. Unified distributed tracing now propagates context from frontend through SGLang and vLLM backends, enabling end-to-end request debugging. A new Planner Grafana dashboard provides real-time SLA monitoring, and per-request metrics include prefill timing and KV cache hit rates. A complete CUDA fault injection framework enables GPU resilience testing in Kubernetes, allowing teams to validate recovery behavior before failures occur in production.
Multi-LoRA Serving
Dynamo 0.8.0 introduces comprehensive multi-LoRA serving for vLLM backends. Deterministic adapter ID generation enables consistent routing across replicas, while new management APIs and a local registry simplify adapter lifecycle. KV-aware routing now extends to LoRA requests, allowing the router to consider both prompt prefix and adapter state when selecting workers. Ready-to-use Kubernetes examples with MinIO sync demonstrate production-grade LoRA deployment patterns.
First-Time Contributors
@yuekaizhang contributed a PR that adds vLLM multimodal audio support for Qwen2-Audio models (#2760)!
@Dilu-Bilu contributed a PR that adds a guide for Speculative Decoding in vLLM using Eagle3 (#3895)!
@sozercan contributed a PR that updates the AKS deployment guide (#3651)
@nv-oviya contributed a PR that adds the CUDA fault injection library foundation (#4038)!
@flpanbin contributed a PR that adds dynamic default max_tokens support for vLLM backend (#4156)!
@AryanBagade contributed a PR that adds output token counter to frontend metrics (#4202)
@Spycsh contributed a PR that enables Intel Gaudi accelerators on Dynamo (#4209)!
@tangcy98 contributed a PR that adds tool call parser support for DeepSeek V3 and R1 (#4253)!
@chandlj contributed a PR that allows users to set --kv-transfer-config for vLLM (#4317)
@2ez4bz contributed a PR that enables autodeploy as a backend for TRT-LLM (#4347)!
@zhongxuanwang-nv contributed a PR that adds the nvext extension field to OpenAI APIs with worker_id reporting (#4372)!
@vladnosiv contributed a PR that fixes KV events config in aggregated router SGLang example (#4391)
@dmitrygx contributed a PR that fixes IPv6 support for SGLang ZMQ endpoint (#4403)
@nancya-nv contributed a PR that fixes model registration for SGLang multimodal workers (#4512)
@Monokaix contributed a PR that makes hostnames more descriptive and simplifies DNS check commands (#4551)
@c-fteixeira contributed a PR that disables etcd PodDisruptionBudget by default in Helm (#4602)
@hypdeb contributed a PR that fixes vLLM deprecation of disable_log_requests (#4659)
@esoba contributed a PR that adds logprobs support to TRT-LLM backend (#4759)!
@gtbai contributed a PR that installs Run:ai model streamer for vLLM (#4848)!
@MatejKosec contributed a PR that fixes vLLM multi-node support for TP and DP modes (#5006)!
Major Features & Improvements
Infrastructure Modernization
etcd Dependency Removal
Kubernetes-native service discovery replaces etcd dependency for simpler K8s deployments.
Kubernetes-Native Service Discovery: Introduced pluggable discovery system (#4070) with Kubernetes-native implementation via EndpointSlices (#4136), made K8s discovery the default (#5024), and added instance unregistration for clean scaling (#4459).
etcd-free Router and Operator: Updated Operator for etcd-less operation (#4214), made Router use the discovery pattern instead of etcd (#4244, #4597), removed etcd client dependency (#4489), and disabled etcd PodDisruptionBudget by default (#4602).
FileStore Auto-Expiring Leases: File-based key-value store entries now support automatic expiration with configurable TTL for local development without etcd. (#4301)
Remove Static Mode: Removed static endpoint functionality. Deployments using static endpoints must migrate to discovery-based endpoints. (#4235)
Namespace Computation: Normalized Dynamo namespace computation for consistent service discovery. (#5231)
NATS Dependency Removal
Dynamo 0.8.0 introduces a transport-agnostic request plane, enabling deployments without NATS for simpler infrastructure.
Transport-Agnostic Request Plane: Introduced transport-agnostic request plane (#4246), added --request-plane CLI flag for tcp, http, or nats selection (#4365), and made TCP the default transport (#4845).
NATS Infrastructure Cleanup: Made NATS metrics conditional on NATS usage (#4442), removed legacy stats handler (#4680), added Helm option to fully disable NATS deployment (#5035), and cleaned up internal NATS code (#4513, #4591).
Decentralized Router with NATS Core: Added support for NATS Core event routing mode as an alternative to JetStream. (#4921)
Multimodality Support
Expanded multimodal support across all backends with video, audio, and improved media handling.
vLLM Multimodal Audio: Added audio support for multimodal inference with Qwen2-Audio models (#2760), enabled efficient decoded media transfer via NIXL (#3988), and added security controls for multimodal requests (#4556).
Frontend Video Decoder: Added video decoder in the frontend preprocessor (#4719) with runtime-configurable settings for frame sampling and memory limits (#5011). Supports all multimodal backends (SGLang, TRT-LLM, vLLM).
Llama4 Multimodal Disaggregated Support: Migrated Llama4 multimodal support to disaggregated serving architecture. (#4213)
KV-Aware Routing Multimodal Support: Added multimodal support to KV-aware routing with standalone TRT-LLM example. (#4577)
OpenAI API
Enhanced OpenAI-compatible API with tool calling support for popular models.
DeepSeek Tool Calling: Added tool call parser for DeepSeek V3 and R1 (#4253), chat template support for V3.2 (#4797), and V3.2 tool calling support (#4822).
Qwen3 Coder Tool Parser: Added support for the Qwen3Coder tool-call format with detection and parsing of tool calls. (#4415)
Tool Choice Support: Added support for Named and Required tool choice modes, enabling explicit control over which tools the model uses. (#4722)
Tool Definitions to Parsers: Tool definitions with parameter metadata can now be supplied to improve parsing accuracy. Parameter values are automatically converted to correct types based on schemas. (#4948)
prompt_tokens_details Support: Added prompt_tokens_details field in usage response for detailed token accounting. (#4239)
nvext Extension Field: Added nvext extension field to OpenAI APIs with worker_id reporting (#4372), and added TTFT and total request time (#4880).
include_stop_str_in_output Support: Added support for include_stop_str_in_output field in completions. (#4924)
Version Upgrades
SGLang 0.5.6.post2 + Upstream Runtime Container: Updated SGLang to 0.5.6.post2 and switched to upstream SGLang runtime container, reducing divergence from upstream and improving long-term maintainability. (#4227, #4716, #4762)
TensorRT-LLM 1.2.0rc6.post1: Updated TensorRT-LLM to 1.2.0rc6.post1 with CUDA 13 support. (#4405, #4645, #4836, #5138, #5356)
vLLM 0.12.0: Updated vLLM to 0.12.0 with CUDA 13 support, bringing upstream performance improvements and expanded GPU compatibility. (#4476, #4736, #4997)
NIXL 0.8.0 + UCX 1.20: Updated NIXL to 0.8.0 and UCX to 1.20, enabling KVBM to work with both CUDA 12 and CUDA 13, with build fixes for dual CUDA version support. (#4281, #5000, #5007)
AIConfigurator 0.5.0: Updated AIConfigurator to version 0.5.0 to match Dynamo 0.8.0. (#5039)
mistral.rs CUDA 13: Upgraded mistral.rs to support CUDA 13. (#4474)
minijinja 2.14.0: Updated minijinja templating library to version 2.14.0. (#4949)
CUDA 13 Container Builds: Added CUDA 13 builds for vLLM and SGLang containers. (#5041)
Performance, Framework, & Multi-Hardware Support
Intel Gaudi Support: Enabled Intel Gaudi accelerators on Dynamo with example configuration for multi-worker setup with KV-aware routing. (#4209)
Dynamic max_tokens: Added support for dynamic default max_tokens in vLLM backend, computed from model capacity and input length. (#4156)
TRT-LLM Autodeploy Backend: Enabled autodeploy as a backend option for TRT-LLM with automatic configuration validation. (#4347)
Logprobs Support: Added logprobs support to both vLLM (#4697) and TRT-LLM (#4759) backends for token-level probability information.
libfabric Transport: Added libfabric transport support for high-performance networking. (#4407)
Run:ai Model Streamer: Installed Run:ai model streamer for vLLM to load models from local paths. (#4848)
Mistral-3-Large Support: Added support for Mistral-3-Large model with automatic format detection. (#4885)
Fault Tolerance & Observability
Unified Tracing: Implemented distributed trace propagation from Dynamo to SGLang (#4248) and vLLM (#4918), enabling end-to-end request tracing across the stack.
OTEL Flag for SGLang: Added --enable-otel flag to SGLang launch scripts for OpenTelemetry support. (#4243)
Planner Grafana Dashboard: Added Grafana dashboard for real-time monitoring of SLA Planner performance metrics. (#4815)
Planner Metrics: Added more metrics to Planner for better observability. (#4710)
Cached Tokens Metric: Added cached_tokens Prometheus metric for prefix cache hit tracking. (#4534)
KVBM Cache Hit Rate: Added reporting for KVBM cache hit rate metrics. (#4333)
Request Migration Metrics: Added request migration metrics tracking by model and migration type. (#5029)
Unified Per-Request Metrics: Added prefill_wait_time_ms, prefill_time_ms, and kv_hit_rate to the nvext timing response. (#5004)
Canary Health Check Default: Health checks now enabled by default in runtime configuration. (#4368)
Health Check Auto-Enable in K8s: Health checks disabled by default but auto-enabled in K8s via operator. (#4804)
Graceful Exit Handling: Improved graceful exit handling; health check returns 503 during shutdown. (#4914)
Kubernetes Deployment
Fault Injection Framework
CUDA Fault Injection Framework: Added complete GPU fault injection framework for resilience testing, including core library, testing utilities, API service, GPU fault injector agent, Kubernetes manifests, and runtime toggling without pod restarts. (#4038, #4040, #4041, #4042, #4043, #4044, #4679)
Operator & CRD Improvements
Validation Webhooks: Added validating webhooks for CRD validation at submission time. (#4416)
Custom GPU Type in CRD: Added custom GPU type field to CRD for specifying GPU types like Intel Xe. (#4408)
DGD Scaling Adapter: Added scaling adapter for DynamoGroupDeployment integrated with Planner for dynamic scaling, disabled by default for stability. (#4699, #4825, #5180)
maxSurge/maxUnavailable: Enabled maxSurge/maxUnavailable configuration for Deployment-backed components via annotations. (#4990)
DGD Service Status Replicas: Added replica information to DGD service status for monitoring deployment health. (#4863)
useMocker Field: Added useMocker field to DGDR for testing and validation without GPU backends. (#4813)
KServe Endpoints: Added metrics endpoint to KServe gRPC service and readiness endpoints (ServerLive, ServerReady, ModelReady). (#4400, #4708)
PVC Mounting for DGDR: Added optional PVC mounting option to DGDR for profiling output. (#4503)
KV Block Manager
TRT-LLM KV Consolidator: Extended KV Event Consolidator to support TensorRT-LLM for event deduplication. (#4533)
Worker-Local KvIndexer: Added worker-local KvIndexer in KvEventPublisher with configurable buffering. (#4519)
Local Indexers for SGLang/TRT-LLM: Enabled local indexers for SGLang and TRT-LLM via --enable-local-indexer flag. (#4932)
Non-Blocking Radix Snapshot: Made radix snapshot upload non-blocking for improved performance. (#4839)
NIXL Concurrency: Improved NIXL concurrency support with async-first APIs. (#4433)
NIXL Descriptor Serializable: Made NIXL descriptor serializable for KV cache transfer. (#4222)
Scheduling
Router
Max Tree Size Pruning: Added max tree size based pruning for router decisions with configurable thresholds. (#4057)
Route to Available Instances: KV-aware routing now routes to available instances only. (#4225)
Request Cancellation P→D: Added request cancellation when transitioning from Prefill to Decode at KV Prefill Router. (#4449)
Dynamic Rejection Thresholds: Added dynamic setting of thresholds for rejection via HTTP endpoints. (#4673)
Early Rejection: Added early rejection based on active prefill tokens. (#4837)
Mocker-Planner Integration: Integrated mocker with planner profiler data, enabling realistic performance simulation for capacity planning without running inference workloads. (#4370, #4422, #4590, #4651)
Profiler WebUI: Added interactive web interface for profiler configuration with GPU cost input for TCO analysis, improved UX, and explanatory visualizations. (#4544, #4935, #4968, #4998)
MQA + MoE Support: Added support for MQA + MoE (Qwen3 MoE) TEP/DEP in Planner Profiler. (#4612)
AIConfigurator 0.4 Support:DynamoPlanner profiler now uses hf_id for AIConfigurator 0.4 compatibility. (#4167)
Multi-LoRA Support
LoRA Management: Added comprehensive LoRA adapter support with deterministic ID generation, management APIs and local registry, Kubernetes deployment example, and ready-to-use public LoRA model examples. (#4457, #4464, #4644, #4714, #4807)
KV-Aware LoRA Routing: Added KV-aware LoRA request routing for vLLM with deployment scripts. (#4810)
Qwen3-235B-A22B-FP8 Recipes: Added Qwen3-235B-A22B-FP8 recipes for aggregated and disaggregated modes with subsequent updates. (#4179, #5255)
GAIE Recipes: Updated GAIE (GPU AI Inference Engine) recipes with disaggregated serving support. (#4756, #4761)
Aggregated vs Disaggregated Comparison: Added recipe comparing aggregated round-robin routing and disaggregated KV-aware routing with results video. (#5021, #5022)
Bug Fixes
Top bug fixes organized by area.
Infrastructure Modernization
KeyValueStore etcd impl: Fixed to use internal etcd client correctly. (#4212)
K8s Native Discovery Fixes: Multiple fixes for Kubernetes-native service discovery, made K8s-backed discovery the default, fixed discovery diff bug, and added support for dynamic metadata updates. (#4305, #4505, #4988, #5024, #5341)
TCP Ingress Performance: Added zero-copy decoder for handling high concurrency and request bursts on TCP transport. (#5376)
Health Check Warnings: Fixed spurious warning messages during health check operations. (#4793)
Graceful Worker Shutdown: Workers now complete in-flight requests before shutting down. (#4838)
Kubernetes Deployment
DYN_SYSTEM_ENABLED Injection: Fixed environment variable injection for system components. (#4282)
Operator Reconciliation: Fixed reconciliation for user edits, namespace handling, cluster-wide operator, pod normalization, and added EndpointSlices RBAC for Planner. (#4470, #4574, #4853, #4868, #5213)
GAIE Recipe Fixes: Multiple fixes for GPU AI Inference Engine recipes including failure threshold and multi-GPU configurations. (#4260, #4313, #4324, #4445, #4472, #4475, #4525, #4557, #4643, #5027)
vLLM Multi-Node TP/DP: Fixed vLLM multi-node deployments for both tensor/pipeline parallel and data parallel modes. (#5006)
Tolerations Apply Everywhere: Fixed Helm chart to ensure tolerations apply to all pods. (#5094)
Recipe Namespace/PVC Fixes: Multiple fixes for recipe namespace, PVC configurations, DGD YAML, and model cache paths. (#4252, #4289, #4290, #4302, #4410, #4471, #4481, #4517, #4566, #5125, #5296)
EPP etcd-less RBAC: Created RBAC structure for EndpointPicker (EPP) etcd-less deployment. (#5373, #5385)
DeepEP Kernel Path: Fixed incorrect path for deep_ep kernel. (#4205)
DSR1 DEP Memory: Increased GPU memory utilization to 0.95 for DeepSeek R1 DEP. (#5199)
KV Block Manager
Logging Init Defer: Fixed startup ordering when OTEL export is enabled. (#4123)
Port Collision: Fixed port collision in KVBM connector. (#4411)
KVBM in Dynamo 0.8.0 has several known limitations:
vLLM Disaggregated Mode - TTFT Degradation Under Load
Performance degradation in TTFT in disk offload scenarios. Fix in Progress.
TensorRT-LLM - Performance Degradation Due to Unconstrained NUMA Domains
When using KVBM with TensorRT-LLM, performance degrades compared to running without KVBM due to some processes being started in different NUMA domains other than domains closest to devices running the model. This is not a problem for single CPU socket systems or systems with a single NUMA domain. If seen, this issue may be mitigated by setting the environment variable TLLM_NUMA_AWARE_WORKER_AFFINITY=1. See #8805 for more details.
CPU Offload Corruption Under High Concurrency with vLLM KVBM connector
When using KVBM CPU offload (DYN_KVBM_CPU_CACHE_GB) with vLLM 0.12.0 under high concurrency, KV cache blocks restored from CPU memory may contain corrupted or incorrect data, causing the model to generate wrong or garbled responses. This appears to be a race condition or synchronization issue during cache restoration. Workaround: Disable CPU offload by not setting DYN_KVBM_CPU_CACHE_GB and use only disk offload DYN_KVBM_DISK_CACHE_GB.
Planner Known Issues
Planner Scaling Test Flaky Under High Load
Under high request load, scaling operations (e.g., 1P1D → 2P1D) may fail to reach expected replica counts. The Dynamo runtime may drop connections before streams are fully established, resulting in "Stream disconnected... recreating stream" errors.
EndPointPicker (EPP) Known Issues
Helm Chart EPP Installation Fails to Build
When building the Dynamo custom EPP using the Helm chart installation path, the build script fails with "No patch file found" error. Fix planned for 0.8.1.
LMCache Known Issues
LMCache Incompatible with CUDA 13
LMCache examples fail with "No module named lmcache" when running on CUDA 13 builds. Workaround: Use CUDA 12.x builds if LMCache functionality is required.
LMCache B200 Corrupted Output
On B200 GPUs, LMCache KV cache restoration may return corrupted output. This is an upstream vLLM/LMCache issue, not a Dynamo bug. Workaround: Use KVBM instead of LMCache on B200, or use H100/A100 GPUs with LMCache.
SGLang Known Issues
SGLang WideEP NIXL Backend Initialization Failure
SGLang WideEP deployments (e.g., sglang/ds1-wideep) may fail to launch the decoder worker with NIXL_ERR_BACKEND errors. This is an upstream SGLang issue, not a Dynamo bug. Workaround: Use --disaggregation-transfer-backend mooncake instead of nixl.
The DeepSeek R1 disaggregated 16-GPU recipe (dsr1-disagg-16gpu) fails with "unknown variant 'abort'" error during JIT compilation. Workaround: Set the environment variable SGLANG_JIT_DEEPGEMM_PRECOMPILE=false and redeploy.
vLLM Known Issues
vLLM DeepSeek R1 Hopper 16-GPU Gibberish Output
The DeepSeek R1 vLLM Hopper 16-GPU recipe may return gibberish output. This is an upstream vLLM issue that occurs without Dynamo in vLLM v0.12.0 and v0.13.0 containers. Tracked in vllm#32190.
GPT-OSS-120B Fails on H100/H200 with vLLM Backend
The GPT-OSS-120B model fails on H100/H200 GPUs due to an upstream vLLM v0.12.0 regression affecting MXFP4 quantization on Hopper GPUs. Tracked in vllm#28894. Workaround: Set export VLLM_MXFP4_USE_MARLIN=1 before launching. This fix is included in the next vLLM release (vllm#30528). For the latest guidance, refer to the vLLM GPT-OSS documentation.
TRT-LLM Known Issues
Llama4 Multimodal with Precomputed Embeddings Fails
Llama4 models (Llama-4-Scout-17B-16E, Llama-4-Maverick-17B-128E) cannot run multimodal inference using precomputed embeddings with the TRT-LLM backend. This is an upstream TRT-LLM issue affecting TRT-LLM v1.2.0rc3 and later.
TRT-LLM Wide EP GB200 Recipe Does Not Run
The trtllm/disagg/wide_ep/gb200 recipe for DeepSeek R1 returns an empty model list from the frontend. The frontend service lacks the model-cache PVC mount that the prefill/decode workers have, so it cannot access model config files. Fix planned for 0.8.1.
TRT-LLM RuntimeError on AWS GB200 Clusters
The GPT-OSS-120B TRT-LLM recipe fails with RuntimeError:800 on certain AWS GB200 clusters. The recipe works correctly on OCI K8s and standard AWS K8s environments. Root cause under investigation—may be related to GPU NVLink connectivity in specific AWS configurations.
What's Next
As we look ahead to Dynamo v0.9.0, we are focused on completing our Kubernetes-native infrastructure transition and expanding performance benchmarks. Planned highlights include:
Dynamo v0.9.0 (Release Date: February 11, 2026):
Complete NATS Removal: Eliminate NATS dependency for the KV events plane, making Dynamo fully deployable without external message brokers.
Separation of Frontend and Router: Decouple the frontend processor from the router for more flexible deployment topologies.
KVBM Object Store Support: Add object store backend for distributed KV cache persistence across nodes.
Multimodal KV-Aware Routing: Extend KV-aware routing with hash-based support for multimodal requests in TRT-LLM.
Topology-Aware Serving: Full support for topology-aware serving including GB200 NVL72 configurations.
Consolidated Benchmarks: Published benchmarks for Qwen3-32B with disaggregated serving, KV-aware routing, and KVBM.
In H1 2026: Looking toward Dynamo 1.0 at GTC '26, we will expand capabilities across five key areas:
Performance:AIConfigurator improvements for all backends, fully composed recipes combining KV-aware routing, disaggregated serving, and KV cache offloading.
Production Grade Serving: Hierarchical Planner for heterogeneous worker pools, request rejection for overload protection, fast recovery with continuous availability, and WideEP fault tolerance.
Kubernetes Platform:Grove topology-aware orchestration with GB200 automation, ModelExpress performance optimization for model loading.
Agentic Workflows: Predictive routing with proactive load balancing, intelligent KV cache retention for high-reuse sessions, and KV cache offloading/prefetching for tool calls.
Multimodality and Diffusion: Multimodal hash router support for vLLM and SGLang, E/P/D disaggregation optimization, and support for SGLang Diffusion/Omni and vLLM Omni.