Dynamo v0.8.0 Release Notes

# **Dynamo v0.8.0 Release Notes** ## **Summary** Dynamo 0.8.0 continues the journey toward production-grade LLM serving with a Kubernetes-native architecture, expanded multimodal and agentic support, and enterprise-ready observability. This release reduces infrastructure complexity while providing a seamless experience regardless of which LLM framework you choose: - **SGLang** - **TRT-LLM** - **vLLM** ### **Kubernetes-Native Infrastructure** In order to address limitations on scaling from etcd and NATS, Dynamo 0.8.0 makes both optional for discovery and request planes: Kubernetes-native service discovery via EndpointSlices replaces etcd, and a transport-agnostic request plane with TCP as the default replaces NATS. Validation webhooks catch CRD errors at submission time, and the operator manages health checks and scaling directly. These changes leverage Kubernetes primitives rather than working around them. ### **Multimodal Support** Dynamo 0.8.0 expands multimodal support across all backends. Audio inference for vLLM enables models like Qwen2-Audio, and a new frontend video decoder handles video input with configurable frame sampling for SGLang, TRT-LLM, and vLLM. Llama4 multimodal now works in disaggregated prefill/decode mode, and KV-aware routing supports multimodal requests end-to-end for TensorRT-LLM. Security controls allow operators to restrict multimodal content sources in production environments. ### **Agentic Workflows** As AI applications evolve from single-turn inference toward autonomous agents that reason, plan, and take action, Dynamo is building the infrastructure to support these workflows. Tool calling is now available for DeepSeek V3/R1/V3.2, Qwen3 Coder, and Jamba model families—the models powering today's most capable agents. Named and Required tool choice modes give explicit control over tool selection, and schema-aware type conversion ensures parameter values match their declared types. The `nvext` extension field provides `worker_id`, TTFT, and per-request timing for debugging multi-step agent pipelines. ### **Disaggregated Serving Performance** Prefill/decode disaggregation at scale requires efficient coordination. Local KV indexers for SGLang and TRT-LLM reduce overhead with the central indexer, while dynamic rejection thresholds and early rejection protect decode workers from overload. Request cancellation now propagates cleanly during prefill-to-decode transitions, and frontend-based prefill routing for SGLang simplifies deployment topology. Non-blocking radix snapshots and async-first NIXL APIs improve transfer throughput across workers. ### **Production Observability and Resilience** Operating LLM infrastructure requires visibility and fault tolerance. Unified distributed tracing now propagates context from frontend through SGLang and vLLM backends, enabling end-to-end request debugging. A new Planner Grafana dashboard provides real-time SLA monitoring, and per-request metrics include prefill timing and KV cache hit rates. A complete CUDA fault injection framework enables GPU resilience testing in Kubernetes, allowing teams to validate recovery behavior before failures occur in production. ### **Multi-LoRA Serving** Dynamo 0.8.0 introduces comprehensive multi-LoRA serving for vLLM backends. Deterministic adapter ID generation enables consistent routing across replicas, while new management APIs and a local registry simplify adapter lifecycle. KV-aware routing now extends to LoRA requests, allowing the router to consider both prompt prefix and adapter state when selecting workers. Ready-to-use Kubernetes examples with MinIO sync demonstrate production-grade LoRA deployment patterns. --- ## **First-Time Contributors** - **@yuekaizhang** contributed a PR that adds vLLM multimodal audio support for Qwen2-Audio models (\#2760)\! - **@Dilu-Bilu** contributed a PR that adds a guide for Speculative Decoding in vLLM using Eagle3 (\#3895)\! - **@sozercan** contributed a PR that updates the AKS deployment guide (\#3651) - **@nv-oviya** contributed a PR that adds the CUDA fault injection library foundation (\#4038)\! - **@flpanbin** contributed a PR that adds dynamic default `max_tokens` support for vLLM backend (\#4156)\! - **@AryanBagade** contributed a PR that adds output token counter to frontend metrics (\#4202) - **@Spycsh** contributed a PR that enables Intel Gaudi accelerators on Dynamo (\#4209)\! - **@tangcy98** contributed a PR that adds tool call parser support for DeepSeek V3 and R1 (\#4253)\! - **@chandlj** contributed a PR that allows users to set `--kv-transfer-config` for vLLM (\#4317) - **@2ez4bz** contributed a PR that enables autodeploy as a backend for TRT-LLM (\#4347)\! - **@zhongxuanwang-nv** contributed a PR that adds the `nvext` extension field to OpenAI APIs with `worker_id` reporting (\#4372)\! - **@vladnosiv** contributed a PR that fixes KV events config in aggregated router SGLang example (\#4391) - **@dmitrygx** contributed a PR that fixes IPv6 support for SGLang `ZMQ` endpoint (\#4403) - **@nancya-nv** contributed a PR that fixes model registration for SGLang multimodal workers (\#4512) - **@Monokaix** contributed a PR that makes hostnames more descriptive and simplifies DNS check commands (\#4551) - **@c-fteixeira** contributed a PR that disables etcd `PodDisruptionBudget` by default in Helm (\#4602) - **@hypdeb** contributed a PR that fixes vLLM deprecation of `disable_log_requests` (\#4659) - **@esoba** contributed a PR that adds logprobs support to TRT-LLM backend (\#4759)\! - **@gtbai** contributed a PR that installs Run:ai model streamer for vLLM (\#4848)\! - **@MatejKosec** contributed a PR that fixes vLLM multi-node support for TP and DP modes (\#5006)\! --- ## **Major Features & Improvements** ### **Infrastructure Modernization** #### **etcd Dependency Removal** Kubernetes-native service discovery replaces etcd dependency for simpler K8s deployments. - **Kubernetes-Native Service Discovery:** Introduced pluggable discovery system (\#4070) with Kubernetes-native implementation via `EndpointSlices` (\#4136), made K8s discovery the default (\#5024), and added instance unregistration for clean scaling (\#4459). - **etcd-free Router and Operator:** Updated Operator for etcd-less operation (\#4214), made Router use the discovery pattern instead of etcd (\#4244, \#4597), removed etcd client dependency (\#4489), and disabled etcd `PodDisruptionBudget` by default (\#4602). - **FileStore Auto-Expiring Leases:** File-based key-value store entries now support automatic expiration with configurable TTL for local development without etcd. (\#4301) - **Remove Static Mode:** Removed static endpoint functionality. Deployments using static endpoints must migrate to discovery-based endpoints. (\#4235) - **Namespace Computation:** Normalized Dynamo namespace computation for consistent service discovery. (\#5231) #### **NATS Dependency Removal** Dynamo 0.8.0 introduces a transport-agnostic request plane, enabling deployments without NATS for simpler infrastructure. - **Transport-Agnostic Request Plane:** Introduced transport-agnostic request plane (\#4246), added `--request-plane` CLI flag for `tcp`, `http`, or `nats` selection (\#4365), and made `TCP` the default transport (\#4845). - **NATS Infrastructure Cleanup:** Made NATS metrics conditional on NATS usage (\#4442), removed legacy stats handler (\#4680), added Helm option to fully disable NATS deployment (\#5035), and cleaned up internal NATS code (\#4513, \#4591). - **Decentralized Router with NATS Core:** Added support for NATS Core event routing mode as an alternative to `JetStream`. (\#4921) #### **Multimodality Support** Expanded multimodal support across all backends with video, audio, and improved media handling. - **vLLM Multimodal Audio:** Added audio support for multimodal inference with Qwen2-Audio models (\#2760), enabled efficient decoded media transfer via NIXL (\#3988), and added security controls for multimodal requests (\#4556). - **Frontend Video Decoder:** Added video decoder in the frontend preprocessor (\#4719) with runtime-configurable settings for frame sampling and memory limits (\#5011). Supports all multimodal backends (SGLang, TRT-LLM, vLLM). - **Llama4 Multimodal Disaggregated Support:** Migrated Llama4 multimodal support to disaggregated serving architecture. (\#4213) - **KV-Aware Routing Multimodal Support:** Added multimodal support to KV-aware routing with standalone TRT-LLM example. (\#4577) #### **OpenAI API** Enhanced OpenAI-compatible API with tool calling support for popular models. - **DeepSeek Tool Calling:** Added tool call parser for DeepSeek V3 and R1 (\#4253), chat template support for V3.2 (\#4797), and V3.2 tool calling support (\#4822). - **Qwen3 Coder Tool Parser:** Added support for the `Qwen3Coder` tool-call format with detection and parsing of tool calls. (\#4415) - **Tool Choice Support:** Added support for Named and Required tool choice modes, enabling explicit control over which tools the model uses. (\#4722) - **Jamba Tool Parsers:** Added Jamba parser configuration for tool call parsing. (\#4776) - **Tool Definitions to Parsers:** Tool definitions with parameter metadata can now be supplied to improve parsing accuracy. Parameter values are automatically converted to correct types based on schemas. (\#4948) - **prompt\_tokens\_details Support:** Added `prompt_tokens_details` field in usage response for detailed token accounting. (\#4239) - **nvext Extension Field:** Added `nvext` extension field to OpenAI APIs with `worker_id` reporting (\#4372), and added TTFT and total request time (\#4880). - **include\_stop\_str\_in\_output Support:** Added support for `include_stop_str_in_output` field in completions. (\#4924) ### **Version Upgrades** - **SGLang 0.5.6.post2 \+ Upstream Runtime Container:** Updated SGLang to 0.5.6.post2 and switched to upstream SGLang runtime container, reducing divergence from upstream and improving long-term maintainability. (\#4227, \#4716, \#4762) - **TensorRT-LLM 1.2.0rc6.post1:** Updated TensorRT-LLM to 1.2.0rc6.post1 with CUDA 13 support. (\#4405, \#4645, \#4836, \#5138, \#5356) - **vLLM 0.12.0:** Updated vLLM to 0.12.0 with CUDA 13 support, bringing upstream performance improvements and expanded GPU compatibility. (\#4476, \#4736, \#4997) - **NIXL 0.8.0 \+ UCX 1.20:** Updated NIXL to 0.8.0 and UCX to 1.20, enabling KVBM to work with both CUDA 12 and CUDA 13, with build fixes for dual CUDA version support. (\#4281, \#5000, \#5007) - **AIConfigurator 0.5.0:** Updated AIConfigurator to version 0.5.0 to match Dynamo 0.8.0. (\#5039) - **mistral.rs CUDA 13:** Upgraded mistral.rs to support CUDA 13\. (\#4474) - **minijinja 2.14.0:** Updated minijinja templating library to version 2.14.0. (\#4949) - **CUDA 13 Container Builds:** Added CUDA 13 builds for vLLM and SGLang containers. (\#5041) ### **Performance, Framework, & Multi-Hardware Support** - **Intel Gaudi Support:** Enabled Intel Gaudi accelerators on Dynamo with example configuration for multi-worker setup with KV-aware routing. (\#4209) - **Dynamic max\_tokens:** Added support for dynamic default `max_tokens` in vLLM backend, computed from model capacity and input length. (\#4156) - **TRT-LLM Autodeploy Backend:** Enabled autodeploy as a backend option for TRT-LLM with automatic configuration validation. (\#4347) - **Logprobs Support:** Added `logprobs` support to both vLLM (\#4697) and TRT-LLM (\#4759) backends for token-level probability information. - **libfabric Transport:** Added `libfabric` transport support for high-performance networking. (\#4407) - **Run:ai Model Streamer:** Installed Run:ai model streamer for vLLM to load models from local paths. (\#4848) - **Mistral-3-Large Support:** Added support for Mistral-3-Large model with automatic format detection. (\#4885) ### **Fault Tolerance & Observability** - **Unified Tracing:** Implemented distributed trace propagation from Dynamo to SGLang (\#4248) and vLLM (\#4918), enabling end-to-end request tracing across the stack. - **OTEL Flag for SGLang:** Added `--enable-otel` flag to SGLang launch scripts for OpenTelemetry support. (\#4243) - **Planner Grafana Dashboard:** Added Grafana dashboard for real-time monitoring of SLA Planner performance metrics. (\#4815) - **Planner Metrics:** Added more metrics to Planner for better observability. (\#4710) - **Cached Tokens Metric:** Added `cached_tokens` Prometheus metric for prefix cache hit tracking. (\#4534) - **KVBM Cache Hit Rate:** Added reporting for KVBM cache hit rate metrics. (\#4333) - **Request Migration Metrics:** Added request migration metrics tracking by model and migration type. (\#5029) - **Unified Per-Request Metrics:** Added `prefill_wait_time_ms`, `prefill_time_ms`, and `kv_hit_rate` to the `nvext` timing response. (\#5004) - **Frontend Per-Request Logging:** Added frontend per-request logging via `on_response` callback. (\#4965) - **Canary Health Check Default:** Health checks now enabled by default in runtime configuration. (\#4368) - **Health Check Auto-Enable in K8s:** Health checks disabled by default but auto-enabled in K8s via operator. (\#4804) - **Graceful Exit Handling:** Improved graceful exit handling; health check returns 503 during shutdown. (\#4914) ### **Kubernetes Deployment** #### **Fault Injection Framework** - **CUDA Fault Injection Framework:** Added complete GPU fault injection framework for resilience testing, including core library, testing utilities, API service, GPU fault injector agent, Kubernetes manifests, and runtime toggling without pod restarts. (\#4038, \#4040, \#4041, \#4042, \#4043, \#4044, \#4679) #### **Operator & CRD Improvements** - **Validation Webhooks:** Added validating webhooks for CRD validation at submission time. (\#4416) - **Custom GPU Type in CRD:** Added custom GPU type field to CRD for specifying GPU types like Intel Xe. (\#4408) - **DGD Scaling Adapter:** Added scaling adapter for `DynamoGroupDeployment` integrated with Planner for dynamic scaling, disabled by default for stability. (\#4699, \#4825, \#5180) - **maxSurge/maxUnavailable:** Enabled `maxSurge`/`maxUnavailable` configuration for Deployment-backed components via annotations. (\#4990) - **DGD Service Status Replicas:** Added replica information to DGD service status for monitoring deployment health. (\#4863) - **useMocker Field:** Added `useMocker` field to DGDR for testing and validation without GPU backends. (\#4813) - **KServe Endpoints:** Added metrics endpoint to KServe gRPC service and readiness endpoints (`ServerLive`, `ServerReady`, `ModelReady`). (\#4400, \#4708) - **PVC Mounting for DGDR:** Added optional PVC mounting option to DGDR for profiling output. (\#4503) ### **KV Block Manager** - **TRT-LLM KV Consolidator:** Extended KV Event Consolidator to support TensorRT-LLM for event deduplication. (\#4533) - **Worker-Local KvIndexer:** Added worker-local `KvIndexer` in `KvEventPublisher` with configurable buffering. (\#4519) - **Local Indexers for SGLang/TRT-LLM:** Enabled local indexers for SGLang and TRT-LLM via `--enable-local-indexer` flag. (\#4932) - **Non-Blocking Radix Snapshot:** Made radix snapshot upload non-blocking for improved performance. (\#4839) - **NIXL Concurrency:** Improved NIXL concurrency support with async-first APIs. (\#4433) - **NIXL Descriptor Serializable:** Made NIXL descriptor serializable for KV cache transfer. (\#4222) ### **Scheduling** #### **Router** - **Max Tree Size Pruning:** Added max tree size based pruning for router decisions with configurable thresholds. (\#4057) - **Route to Available Instances:** KV-aware routing now routes to available instances only. (\#4225) - **Request Cancellation P→D:** Added request cancellation when transitioning from Prefill to Decode at KV Prefill Router. (\#4449) - **Dynamic Rejection Thresholds:** Added dynamic setting of thresholds for rejection via HTTP endpoints. (\#4673) - **Early Rejection:** Added early rejection based on active prefill tokens. (\#4837) - **SGLang Prefill Routing:** Added frontend-based prefill request routing for SGLang. (\#4635) - **SGLang Disaggregated Router Fixes:** Optimized SGLang disaggregated router with performance improvements. (\#5174) #### **Planner** - **Mocker-Planner Integration:** Integrated mocker with planner profiler data, enabling realistic performance simulation for capacity planning without running inference workloads. (\#4370, \#4422, \#4590, \#4651) - **Profiler WebUI:** Added interactive web interface for profiler configuration with GPU cost input for TCO analysis, improved UX, and explanatory visualizations. (\#4544, \#4935, \#4968, \#4998) - **MQA \+ MoE Support:** Added support for MQA \+ MoE (Qwen3 MoE) TEP/DEP in Planner Profiler. (\#4612) - **SGLang Profiling Routes:** Added Python-configurable engine routes for SGLang profiling. (\#4617) - **AIConfigurator 0.4 Support:** `DynamoPlanner` profiler now uses `hf_id` for AIConfigurator 0.4 compatibility. (\#4167) ### **Multi-LoRA Support** - **LoRA Management:** Added comprehensive LoRA adapter support with deterministic ID generation, management APIs and local registry, Kubernetes deployment example, and ready-to-use public LoRA model examples. (\#4457, \#4464, \#4644, \#4714, \#4807) - **KV-Aware LoRA Routing:** Added KV-aware LoRA request routing for vLLM with deployment scripts. (\#4810) ### **Recipes** - **DeepSeek R1 Recipes:** Added vLLM DeepSeek R1 recipe for disaggregated serving. (\#4463) - **Qwen3-235B-A22B-FP8 Recipes:** Added Qwen3-235B-A22B-FP8 recipes for aggregated and disaggregated modes with subsequent updates. (\#4179, \#5255) - **GAIE Recipes:** Updated GAIE (GPU AI Inference Engine) recipes with disaggregated serving support. (\#4756, \#4761) - **Aggregated vs Disaggregated Comparison:** Added recipe comparing aggregated round-robin routing and disaggregated KV-aware routing with results video. (\#5021, \#5022) --- ## **Bug Fixes** *Top bug fixes organized by area.* ### **Infrastructure Modernization** - **KeyValueStore etcd impl:** Fixed to use internal etcd client correctly. (\#4212) - **K8s Native Discovery Fixes:** Multiple fixes for Kubernetes-native service discovery, made K8s-backed discovery the default, fixed discovery diff bug, and added support for dynamic metadata updates. (\#4305, \#4505, \#4988, \#5024, \#5341) - **FileStore Improvements:** Fixed metadata change notifications, key encoding, and keep-alive cancellation. (\#4434, \#4539, \#4666) - **dynamo-run without etcd/nats:** Fixed to work without external dependencies, HTTP port now defaults to 8000\. (\#4555) - **Prefill/Decode Discovery:** Fixed model availability detection and duplicate registration. (\#4911, \#4927) - **etcd Deadlock Prevention:** Fixed deadlock when flushing initial keys. (\#5205) - **Final Usage Chunk:** Fixed `annotated_usage.event` in final usage chunk. (\#3729) - **Tool Calling Usage:** Fixed handling of responses with data but no choices. (\#4516) - **Parallel Tool Call Indices:** Fixed tool call indexing in streaming responses. (\#4723) - **Guided Decoding Params:** Added support for guided decoding with structured outputs. (\#4770) - **Stop Field Frontend:** Fixed frontend population of "stop" field in requests. (\#4782) - **ModelDeploymentCard Fixes:** Fixed `eos_token_ids` retrieval order and `add_tensor_model` handling. (\#3192, \#4169) - **DYN\_SYSTEM\_ENABLED Cleanup:** Removed remaining obsolete environment variable references. (\#4197) ### **Backends** #### **vLLM** - **Port Allocation ZMQ:** Fixed port allocation bugs leading to `ZMQ` errors. (\#4321) - **Prometheus Cleanup:** Fixed Prometheus metrics cleanup. (\#4323) - **cublas Symlinks:** Fixed `cublas` and `cublasLt` symlinks in vLLM runtime container. (\#4421) - **Finder Module Path:** Fixed finder module path resolution. (\#4691) - **KV Transfer Bug:** Fixed distributed inference KV cache transfer initialization. (\#4745) - **KVBM NIXL Connector:** Fixed KVBM NIXL connector for vLLM upgrade. (\#4957) - **Generation Prompt Default:** Fixed chat completion to consistently generate prompts and send HF model names instead of full paths. (\#5223, \#5308) - **Qwen VL Decode Worker:** Fixed decode worker for Qwen Vision-Language models. (\#5281) - **Multimodal EPD Examples:** Fixed multimodal EPD examples for vLLM version bump. (\#4849) - **Image URL Rendering:** Fixed image\_url block rendering in chat prompt templates. (\#4854) - **GPU Memory Utilization:** Removed restrictive GPU memory utilization to avoid OOM after vLLM upgrade. (\#4899) #### **SGLang** - **Multi-Node Metrics:** Fixed metrics on multi-node setups. (\#4238) - **IPv6 ZMQ/Bootstrap:** Fixed IPv6 support for `ZMQ` endpoint and bootstrap. (\#3019, \#4403) - **Model Registration Multimodal:** Fixed model registration for multimodal workers. (\#4512) - **Launch Screen Fix:** Fixed SGLang launch screen display. (\#5178) - **CUDA 13 Setup:** Fixed SGLang CUDA 13 setup. (\#5304) - **Metrics Visibility:** Fixed SGLang metrics visibility. (\#5294) - **SGLANG\_BLOCK\_NONZERO\_RANK\_CHILDREN:** Set environment variable to be no-op to fix multi-worker issues. (\#4527) - **CUDA 13 Build Script:** Fixed build.sh to handle CUDA 13 case properly for SGLang. (\#5284) - **urllib3 Dependency:** Bumped urllib3 version in SGLang container for security. (\#5360) #### **TensorRT-LLM** - **Build from Source:** Fixed TRTLLM build from source wheel handling. (\#4241, \#4414) - **Multinode DGD/LWS:** Fixed TRT-LLM multinode deployment with LWS and DNS wait. (\#4389, \#4477) - **Cache Transceiver:** Fixed cache transceiver settings for TRT-LLM 1.2. (\#4703) - **NCCL Symlink:** Fixed `NCCL` symlink in TRT-LLM container. (\#5257) - **Worker OOM Tuning:** Fixed OOM issues by tuning worker parameters. (\#5251) - **Wide-EP GB200 Recipe Fix:** Updated TRT-LLM Wide-EP disaggregated GB200 recipe for version compatibility. (\#5383) - **Async Synchronization:** Moved CUDA synchronization to separate thread to avoid blocking forward pass. (\#5333) - **KVBM Runtime Image:** Added KVBM arg in TRT-LLM runtime image build. (\#4337) - **Volume Mount Permissions:** Fixed permission issues with volume mounts and multinode. (\#4341) - **Multimodal Worker Input:** Moved multimodal worker to use `ModelInput.Token`. (\#4373) - **UCX KV Cache Environment:** Fixed injection of `TRTLLM_USE_UCX_KVCACHE` env var in mpirun. (\#4609) - **numactl Installation:** Added numactl to TRT-LLM container for NUMA-aware deployments. (\#5367) ### **Fault Tolerance & Observability** - **LMCache Metrics:** Fixed and exposed LMCache metrics via Dynamo endpoint. (\#4461, \#4654) - **OTEL\_EXPORT\_ENABLED Truthy:** Fixed to support truthy values ("true", "1", "on", "yes"). (\#5129) - **cached\_tokens Non-Streaming:** Fixed `cached_tokens` metric for non-streaming requests. (\#5197) - **TCP Tracing Propagation:** Fixed distributed tracing propagation for TCP transport. (\#5283) - **Tokio Runtime Drop:** Fixed handling of dropping `tokio` runtime in async context. (\#4343) - **TCP Health Check Endpoint:** Fixed health check endpoint for TCP transport. (\#4494, \#4522) - **Health Check Race Condition:** Fixed race condition by waiting for watch stream. (\#4876) - **Logger Initialization:** Fixed logger initialization to capture startup logs earlier. (\#3570) - **Backend Error Serialization:** Enhanced error response structure with detailed messages. (\#4786) - **TCP Ingress Performance:** Added zero-copy decoder for handling high concurrency and request bursts on TCP transport. (\#5376) - **Health Check Warnings:** Fixed spurious warning messages during health check operations. (\#4793) - **Graceful Worker Shutdown:** Workers now complete in-flight requests before shutting down. (\#4838) ### **Kubernetes Deployment** - **DYN\_SYSTEM\_ENABLED Injection:** Fixed environment variable injection for system components. (\#4282) - **Operator Reconciliation:** Fixed reconciliation for user edits, namespace handling, cluster-wide operator, pod normalization, and added EndpointSlices RBAC for Planner. (\#4470, \#4574, \#4853, \#4868, \#5213) - **GAIE Recipe Fixes:** Multiple fixes for GPU AI Inference Engine recipes including failure threshold and multi-GPU configurations. (\#4260, \#4313, \#4324, \#4445, \#4472, \#4475, \#4525, \#4557, \#4643, \#5027) - **vLLM Multi-Node TP/DP:** Fixed vLLM multi-node deployments for both tensor/pipeline parallel and data parallel modes. (\#5006) - **Tolerations Apply Everywhere:** Fixed Helm chart to ensure tolerations apply to all pods. (\#5094) - **Recipe Namespace/PVC Fixes:** Multiple fixes for recipe namespace, PVC configurations, DGD YAML, and model cache paths. (\#4252, \#4289, \#4290, \#4302, \#4410, \#4471, \#4481, \#4517, \#4566, \#5125, \#5296) - **EPP etcd-less RBAC:** Created RBAC structure for EndpointPicker (EPP) etcd-less deployment. (\#5373, \#5385) - **DeepEP Kernel Path:** Fixed incorrect path for deep\_ep kernel. (\#4205) - **DSR1 DEP Memory:** Increased GPU memory utilization to 0.95 for DeepSeek R1 DEP. (\#5199) ### **KV Block Manager** - **Logging Init Defer:** Fixed startup ordering when OTEL export is enabled. (\#4123) - **Port Collision:** Fixed port collision in KVBM connector. (\#4411) - **KV-Indexer Self-Reference Guard:** Fixed invalid block structures preventing infinite loops. (\#4395) - **O\_DIRECT Lustre Compatibility:** Added page-aligned writes for Lustre compatibility. (\#4508) - **CUDA Fatbin Update:** Updated CUDA `fatbin` for ARM64 and x86-64 compatibility. (\#4700) - **Don't Recursively Drop Blocks:** Fixed stack overflow during cleanup of nested structures. (\#4995) - **KvCacheConfig Lost:** Fixed cache configuration settings being replaced when metrics enabled. (\#5198) - **KVBM Wheel Generation:** Restored KVBM wheel generation for `framework=none` builds. (\#4772) ### **Scheduling** - **Router Slot Manager Stability:** Improved Router slot manager error handling to avoid panics and properly detect hidden errors. (\#4267, \#4496) - **Router Prune Channel:** Changed prune channel to MPSC and increased max size for stability. (\#4351) - **Router Transport Registration:** Allowed router to be registered as general transport type. (\#4633) - **NATS Registration Race:** Fixed race condition in async NATS registration task using channels. (\#4758) - **Gap Detection E2E:** Fixed gap detection to work end-to-end. (\#4993) ### **OpenAI API / Frontend** - **Non-Standard Jinja2 Tags:** Added support for multimodal models with non-standard Jinja2 template tags. (\#4379) - **Message Content List:** Added support for `msg[content]` as a list in chat completions. (\#4485) - **Worker ID Updates:** Fixed worker\_id updates not being applied correctly. (\#4518) - **Prefill Worker ID:** Exposed prefill worker ID in disaggregated mode. (\#4563) - **min\_tokens with ignore\_eos:** Fixed to not set `min_tokens` when `ignore_eos` is true. (\#4872) - **Skip HuggingFace Download:** Skip HuggingFace download for non-LLM models. (\#4686) ### **Multi-LoRA Support** - **Concurrent LoRA Loading:** Fixed race condition when loading multiple LoRA adapters concurrently. (\#5184) --- ## **Documentation** - **Request Plane Docs:** Added comprehensive guide on alternative request planes (NATS, TCP, HTTP). (\#4491) - **Kubernetes Service Discovery Docs:** Added documentation for Kubernetes-based service discovery. (\#5232) - **Multimodal Docs Consolidation:** Consolidated multimodal documentation for vLLM, SGLang, and TRT-LLM. (\#4510, \#4842) - **Support Matrix Updates:** Updated support matrix with build dependency version history. (\#4550, \#4953) - **Speculative Decoding Guide:** Added guide for Speculative Decoding in vLLM using Eagle3. (\#3895) - **KVBM Runbooks:** Updated KVBM runbooks with metrics changes and local development section. (\#4266, \#4316, \#4488) - **Planner Tests Volume Instructions:** Added instructions for removing `volumes`/`volumeMounts` in Planner tests. (\#4335) - **LoRA Kubernetes Example:** Added LoRA K8s deployment example with MinIO sync. (\#4714) - **Issue-First Workflow:** Added issue-first workflow documentation for external contributors. (\#4775) - **KV Cache UCX Config:** Fixed KV cache transfer UCX configuration instructions. (\#5247) - **Feature Compatibility Matrix:** Added feature compatibility matrix and updated support matrix for v0.8.0. (\#5402) - **Metrics Documentation:** Fixed metrics documentation that was missing 2 metrics. (\#4309) - **ZMQ Port Documentation:** Updated docs to assign port to avoid ZMQ port collision. (\#4469) - **Tracing Doc Port:** Fixed tracing documentation ZMQ port conflict. (\#5215) - **HF Token Documentation:** Fixed HuggingFace token documentation. (\#5319) --- ## **Known Issues** ### **KVBM Known Issues** KVBM in Dynamo 0.8.0 has several known limitations: #### **vLLM Disaggregated Mode \- TTFT Degradation Under Load** - Performance degradation in TTFT in disk offload scenarios. Fix in Progress. #### **TensorRT-LLM \- Performance Degradation Due to Unconstrained NUMA Domains** - When using KVBM with TensorRT-LLM, performance degrades compared to running without KVBM due to some processes being started in different NUMA domains other than domains closest to devices running the model. This is not a problem for single CPU socket systems or systems with a single NUMA domain. If seen, this issue may be mitigated by setting the environment variable `TLLM_NUMA_AWARE_WORKER_AFFINITY=1`. See [\#8805](https://github.com/NVIDIA/TensorRT-LLM/pull/8805) for more details. #### **CPU Offload Corruption Under High Concurrency with vLLM KVBM connector** - When using KVBM CPU offload (`DYN_KVBM_CPU_CACHE_GB`) with vLLM 0.12.0 under high concurrency, KV cache blocks restored from CPU memory may contain corrupted or incorrect data, causing the model to generate wrong or garbled responses. This appears to be a race condition or synchronization issue during cache restoration. **Workaround:** Disable CPU offload by not setting `DYN_KVBM_CPU_CACHE_GB` and use only disk offload `DYN_KVBM_DISK_CACHE_GB`. ### **Planner Known Issues** #### **Planner Scaling Test Flaky Under High Load** - Under high request load, scaling operations (e.g., 1P1D → 2P1D) may fail to reach expected replica counts. The Dynamo runtime may drop connections before streams are fully established, resulting in "Stream disconnected... recreating stream" errors. ### **EndPointPicker (EPP) Known Issues** #### **Helm Chart EPP Installation Fails to Build** - When building the Dynamo custom EPP using the Helm chart installation path, the build script fails with "No patch file found" error. Fix planned for 0.8.1. ### **LMCache Known Issues** #### **LMCache Incompatible with CUDA 13** - LMCache examples fail with "No module named `lmcache`" when running on CUDA 13 builds. **Workaround:** Use CUDA 12.x builds if LMCache functionality is required. #### **LMCache B200 Corrupted Output** - On B200 GPUs, LMCache KV cache restoration may return corrupted output. This is an upstream vLLM/LMCache issue, not a Dynamo bug. **Workaround:** Use KVBM instead of LMCache on B200, or use H100/A100 GPUs with LMCache. ### **SGLang Known Issues** #### **SGLang WideEP NIXL Backend Initialization Failure** - SGLang WideEP deployments (e.g., `sglang/ds1-wideep`) may fail to launch the decoder worker with `NIXL_ERR_BACKEND` errors. This is an upstream SGLang issue, not a Dynamo bug. **Workaround:** Use `--disaggregation-transfer-backend mooncake` instead of `nixl`. #### **SGLang DeepSeek R1 Disaggregated 16-GPU Recipe Failure** - The DeepSeek R1 disaggregated 16-GPU recipe (`dsr1-disagg-16gpu`) fails with "unknown variant 'abort'" error during JIT compilation. **Workaround:** Set the environment variable `SGLANG_JIT_DEEPGEMM_PRECOMPILE=false` and redeploy. ### **vLLM Known Issues** #### **vLLM DeepSeek R1 Hopper 16-GPU Gibberish Output** - The DeepSeek R1 vLLM Hopper 16-GPU recipe may return gibberish output. This is an upstream vLLM issue that occurs without Dynamo in vLLM v0.12.0 and v0.13.0 containers. Tracked in [vllm\#32190](https://github.com/vllm-project/vllm/issues/32190). #### **GPT-OSS-120B Fails on H100/H200 with vLLM Backend** - The GPT-OSS-120B model fails on H100/H200 GPUs due to an upstream vLLM v0.12.0 regression affecting MXFP4 quantization on Hopper GPUs. Tracked in [vllm\#28894](https://github.com/vllm-project/vllm/issues/28894). **Workaround:** Set `export VLLM_MXFP4_USE_MARLIN=1` before launching. This fix is included in the next vLLM release ([vllm\#30528](https://github.com/vllm-project/vllm/pull/30528)). For the latest guidance, refer to the [vLLM GPT-OSS documentation](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html). ### **TRT-LLM Known Issues** #### **Llama4 Multimodal with Precomputed Embeddings Fails** - Llama4 models (Llama-4-Scout-17B-16E, Llama-4-Maverick-17B-128E) cannot run multimodal inference using precomputed embeddings with the TRT-LLM backend. This is an upstream TRT-LLM issue affecting TRT-LLM v1.2.0rc3 and later. #### **TRT-LLM Wide EP GB200 Recipe Does Not Run** - The `trtllm/disagg/wide_ep/gb200` recipe for DeepSeek R1 returns an empty model list from the frontend. The frontend service lacks the `model-cache` PVC mount that the prefill/decode workers have, so it cannot access model config files. Fix planned for 0.8.1. #### **TRT-LLM RuntimeError on AWS GB200 Clusters** - The GPT-OSS-120B TRT-LLM recipe fails with `RuntimeError:800` on certain AWS GB200 clusters. The recipe works correctly on OCI K8s and standard AWS K8s environments. Root cause under investigation—may be related to GPU NVLink connectivity in specific AWS configurations. --- ## **What's Next** As we look ahead to Dynamo v0.9.0, we are focused on completing our Kubernetes-native infrastructure transition and expanding performance benchmarks. Planned highlights include: **Dynamo v0.9.0 (Release Date: February 11, 2026):** - **Complete NATS Removal:** Eliminate NATS dependency for the KV events plane, making Dynamo fully deployable without external message brokers. - **Separation of Frontend and Router:** Decouple the frontend processor from the router for more flexible deployment topologies. - **KVBM Object Store Support:** Add object store backend for distributed KV cache persistence across nodes. - **Multimodal KV-Aware Routing:** Extend KV-aware routing with hash-based support for multimodal requests in TRT-LLM. - **Topology-Aware Serving:** Full support for topology-aware serving including GB200 NVL72 configurations. - **Consolidated Benchmarks:** Published benchmarks for Qwen3-32B with disaggregated serving, KV-aware routing, and KVBM. **In H1 2026:** Looking toward Dynamo 1.0 at GTC '26, we will expand capabilities across five key areas: - **Performance:** [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator) improvements for all backends, fully composed recipes combining KV-aware routing, disaggregated serving, and KV cache offloading. - **Production Grade Serving:** Hierarchical Planner for heterogeneous worker pools, request rejection for overload protection, fast recovery with continuous availability, and WideEP fault tolerance. - **Kubernetes Platform:** [Grove](https://github.com/ai-dynamo/grove) topology-aware orchestration with GB200 automation, [ModelExpress](https://github.com/ai-dynamo/modelexpress) performance optimization for model loading. - **Agentic Workflows:** Predictive routing with proactive load balancing, intelligent KV cache retention for high-reuse sessions, and KV cache offloading/prefetching for tool calls. - **Multimodality and Diffusion:** Multimodal hash router support for vLLM and SGLang, E/P/D disaggregation optimization, and support for SGLang Diffusion/Omni and vLLM Omni.

Dynamo v0.8.0 Release Notes

Summary

Kubernetes-Native Infrastructure

Multimodal Support

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

JPProject.IdentityServer4.SSO

Dynamo v0.8.0 Release Notes

Summary

Kubernetes-Native Infrastructure

Multimodal Support

Agentic Workflows

Disaggregated Serving Performance

Production Observability and Resilience

Multi-LoRA Serving

First-Time Contributors

Major Features & Improvements

Infrastructure Modernization

etcd Dependency Removal

NATS Dependency Removal

Multimodality Support

OpenAI API

Version Upgrades

Performance, Framework, & Multi-Hardware Support

Fault Tolerance & Observability

Kubernetes Deployment

Fault Injection Framework

Operator & CRD Improvements

KV Block Manager

Scheduling

Router

Planner

Multi-LoRA Support

Recipes

Bug Fixes

Infrastructure Modernization

Backends

vLLM

SGLang

TensorRT-LLM

Fault Tolerance & Observability

Kubernetes Deployment

KV Block Manager

Scheduling

OpenAI API / Frontend

Multi-LoRA Support

Documentation

Known Issues

KVBM Known Issues

vLLM Disaggregated Mode - TTFT Degradation Under Load

TensorRT-LLM - Performance Degradation Due to Unconstrained NUMA Domains

CPU Offload Corruption Under High Concurrency with vLLM KVBM connector

Planner Known Issues

Planner Scaling Test Flaky Under High Load

EndPointPicker (EPP) Known Issues

Helm Chart EPP Installation Fails to Build

LMCache Known Issues

LMCache Incompatible with CUDA 13

LMCache B200 Corrupted Output

SGLang Known Issues

SGLang WideEP NIXL Backend Initialization Failure

SGLang DeepSeek R1 Disaggregated 16-GPU Recipe Failure

vLLM Known Issues

vLLM DeepSeek R1 Hopper 16-GPU Gibberish Output

GPT-OSS-120B Fails on H100/H200 with vLLM Backend

TRT-LLM Known Issues

Llama4 Multimodal with Precomputed Embeddings Fails

TRT-LLM Wide EP GB200 Recipe Does Not Run

TRT-LLM RuntimeError on AWS GB200 Clusters

What's Next

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

JPProject.IdentityServer4.SSO