0.9.0

Summary

NVIDIA® NIXL Release 0.9.0 delivers significant new capabilities and performance improvements. Key highlights include the introduction of the UCCL backend for optimized collective communication, a new Telemetry Plugin infrastructure with Prometheus support, and the NIXL-EP example demonstrating expert-parallel dispatch. This release also adds support for Python 3.14, enables Shared Memory for Libfabric intra-node transfers, and includes important performance optimizations for core request handling.

This version contains breaking changes, specifically the removal of support for Python 3.9.

Major Features

UCCL Backend Integration: Added support for the UCCL P2P backend, enabling efficient GPU memory transfers over RDMA. (#895)
Telemetry Plugin Infrastructure: A new extensible telemetry plugin system has been introduced, allowing for custom metric exporters.
Prometheus Exporter: A new plugin to export metrics to Prometheus. (#1091)
Cyclic Buffer Exporter: A plugin for high-performance cyclic buffer telemetry. (#1088)
Plugin Manager: Infrastructure to support loading and managing telemetry plugins. (#1070)
NIXL-EP (Elasticity Example): Introduced examples/device/ep, a comprehensive example demonstrating expert-parallel dispatch and combine operations using the NIXL device API. This includes improved metadata fetching and CI integration. (#1043, #1132, #1104, #1077)
- Enable CUDA IPC NVLINK backend: NIXL-EP now enables the CUDA IPC NVLINK backend for improved intra-node GPU communication. (#1099)
Libfabric Shared Memory Support: Enabled the shared memory provider (shm) for NVLink intra-node transfers in the Libfabric backend. This improves performance for local GPU-to-GPU communication by leveraging NVLink without requiring network transport. (#1076)

Breaking Changes

Python 3.9 Support Removed: Support for Python 3.9 has been removed. The supported Python versions are now 3.10 through 3.14. (#1071)

API Changes

[Python] Python 3.14 Support: Added official support for Python 3.14. (#1071)
[Python] Explicit API Exports: Python APIs are now explicitly exported using __all__ to control the public namespace and cleaner imports. (#1062)
[Rust] Custom Backend Parameters: Added support for passing custom backend parameters in Rust bindings. (#900)
[Rust] Descriptor Serialization: Added Serde serialization support for RegDescList and XferDescList. (#829)
[Rust] Indexing Support: Added Index/IndexMut and get/get_mut methods for descriptor lists. (#1003)

Enhancements

Performance

[Core] Request Handling Optimization: Implemented request handling optimizations to reduce overhead for large batches of small messages. (#1009)
[UCX] Relaxed Ordering: Set UCX_IB_PCI_RELAXED_ORDERING=try by default to improve PCI performance where supported. (#1012)
[Libfabric] EFA Unsolicited Write Recv: Added FI_OPT_EFA_USE_UNSOLICITED_WRITE_RECV option to disable unsolicited write receives on EFA RDM, reducing CQ overflows under high load. (#1084)
[Libfabric] Large Message Notifications: Implemented notification fragmentation for large messages to better support large transfers (e.g., TensorRT-LLM disaggregated workloads). (#1182)
[NIXL-EP] Parallel Metadata Fetch: Optimized connection establishment in NIXL-EP by parallelizing metadata fetches. (#1132)

Build & CI

Selective Plugin Building: Added Meson options to selectively enable or disable specific plugins during build (-Denable_plugins=...). (#951)
CUDA 13 Support: Updated CI infrastructure to support CUDA 13 for GPU tests. (#996)
POSIX Plugin Dependencies: Fixed POSIX plugin dependency handling in Meson to resolve build issues with TRTLLM. (#1086)
Python License: Fixed license identifier in Python packages to match the LICENSE file. (#1119)
manylinux wheel packaging: Added the uring library to the manylinux 0.9.0 Docker image so it’s included with wheels (enables POSIX plugin access to performant async I/O options). (#1185)

Documentation

Libfabric Guide: Improved clarity and grammar in the Libfabric README. (#1007)
Examples: Enhanced the basic Python examples and nixl_ep documentation. (#1078, #1093)

Bugfixes

[POSIX] AIO Resubmission: Fixed a bug where the Linux AIO plugin could not correctly resubmit I/O requests. (#1020)
[Libfabric] Sockets Deadlock: Fixed a connection deadlock with the sockets provider by reducing the CQ read timeout. (#1080)
[Libfabric] Topology Grouping: Fixed GPU NIC grouping logic in libfabric_topology when multiple GPUs share a NIC. (#1024)
[Libfabric] GPU-to-EFA mapping: Fixed GPU-to-EFA mapping by using PCI bus IDs instead of GPU IDs, ensuring correct device association. (#1184)
[Core] Metadata Crash: Fixed a crash that occurred if a peer closed the connection during metadata exchange. (#854)
[Telemetry] Dangling Pointer: Resolved a dangling pointer issue in getName/getVersion in the telemetry plugin. (#1148)
[Benchmark] nixlbench Memory: nixlbench now allocates page-aligned memory by default to ensure consistency. (#1060)
[Python] venv Support: Fixed Python test scripts to work correctly inside uv virtual environments. (#1106)
[UCX] Device API Detection: Fixed detection logic for UCX GPU device API support. (#990)

Benchmarks & Test Infrastructure

[kvbench] FLOPs Estimation: Added Tensor Parallel (TP) scaling and MLP FLOPs to compute time estimates in kvbench. (#1083)
[nixlbench] Consistency Check: Added data validation consistency checks to nixlbench. (#1103)

Known Issues

Full Changelog: https://github.com/ai-dynamo/nixl/compare/0.8.0...0.9.0

nixl

More C++ Projects

tensorflow

electron

llama.cpp