0.9.0
Summary
NVIDIA® NIXL Release 0.9.0 delivers significant new capabilities and performance improvements. Key highlights include the introduction of the UCCL backend for optimized collective communication, a new Telemetry Plugin infrastructure with Prometheus support, and the NIXL-EP example demonstrating expert-parallel dispatch. This release also adds support for Python 3.14, enables Shared Memory for Libfabric intra-node transfers, and includes important performance optimizations for core request handling.
This version contains breaking changes, specifically the removal of support for Python 3.9.
Major Features
- UCCL Backend Integration: Added support for the UCCL P2P backend, enabling efficient GPU memory transfers over RDMA. (#895)
- Telemetry Plugin Infrastructure: A new extensible telemetry plugin system has been introduced, allowing for custom metric exporters.
- Prometheus Exporter: A new plugin to export metrics to Prometheus. (#1091)
- Cyclic Buffer Exporter: A plugin for high-performance cyclic buffer telemetry. (#1088)
- Plugin Manager: Infrastructure to support loading and managing telemetry plugins. (#1070)
- NIXL-EP (Elasticity Example): Introduced
examples/device/ep, a comprehensive example demonstrating expert-parallel dispatch and combine operations using the NIXL device API. This includes improved metadata fetching and CI integration. (#1043, #1132, #1104, #1077)- Enable CUDA IPC NVLINK backend: NIXL-EP now enables the CUDA IPC NVLINK backend for improved intra-node GPU communication. (#1099)
- Libfabric Shared Memory Support: Enabled the shared memory provider (
shm) for NVLink intra-node transfers in the Libfabric backend. This improves performance for local GPU-to-GPU communication by leveraging NVLink without requiring network transport. (#1076)
Breaking Changes
- Python 3.9 Support Removed: Support for Python 3.9 has been removed. The supported Python versions are now 3.10 through 3.14. (#1071)
API Changes
- [Python] Python 3.14 Support: Added official support for Python 3.14. (#1071)
- [Python] Explicit API Exports: Python APIs are now explicitly exported using
__all__to control the public namespace and cleaner imports. (#1062) - [Rust] Custom Backend Parameters: Added support for passing custom backend parameters in Rust bindings. (#900)
- [Rust] Descriptor Serialization: Added
Serdeserialization support forRegDescListandXferDescList. (#829) - [Rust] Indexing Support: Added
Index/IndexMutandget/get_mutmethods for descriptor lists. (#1003)
Enhancements
Performance
- [Core] Request Handling Optimization: Implemented request handling optimizations to reduce overhead for large batches of small messages. (#1009)
- [UCX] Relaxed Ordering: Set
UCX_IB_PCI_RELAXED_ORDERING=tryby default to improve PCI performance where supported. (#1012) - [Libfabric] EFA Unsolicited Write Recv: Added
FI_OPT_EFA_USE_UNSOLICITED_WRITE_RECVoption to disable unsolicited write receives on EFA RDM, reducing CQ overflows under high load. (#1084) - [Libfabric] Large Message Notifications: Implemented notification fragmentation for large messages to better support large transfers (e.g., TensorRT-LLM disaggregated workloads). (#1182)
- [NIXL-EP] Parallel Metadata Fetch: Optimized connection establishment in NIXL-EP by parallelizing metadata fetches. (#1132)
Build & CI
- Selective Plugin Building: Added Meson options to selectively enable or disable specific plugins during build (
-Denable_plugins=...). (#951) - CUDA 13 Support: Updated CI infrastructure to support CUDA 13 for GPU tests. (#996)
- POSIX Plugin Dependencies: Fixed POSIX plugin dependency handling in Meson to resolve build issues with TRTLLM. (#1086)
- Python License: Fixed license identifier in Python packages to match the LICENSE file. (#1119)
- manylinux wheel packaging: Added the
uringlibrary to the manylinux 0.9.0 Docker image so it’s included with wheels (enables POSIX plugin access to performant async I/O options). (#1185)
Documentation
- Libfabric Guide: Improved clarity and grammar in the Libfabric README. (#1007)
- Examples: Enhanced the basic Python examples and
nixl_epdocumentation. (#1078, #1093)
Bugfixes
- [POSIX] AIO Resubmission: Fixed a bug where the Linux AIO plugin could not correctly resubmit I/O requests. (#1020)
- [Libfabric] Sockets Deadlock: Fixed a connection deadlock with the sockets provider by reducing the CQ read timeout. (#1080)
- [Libfabric] Topology Grouping: Fixed GPU NIC grouping logic in
libfabric_topologywhen multiple GPUs share a NIC. (#1024) - [Libfabric] GPU-to-EFA mapping: Fixed GPU-to-EFA mapping by using PCI bus IDs instead of GPU IDs, ensuring correct device association. (#1184)
- [Core] Metadata Crash: Fixed a crash that occurred if a peer closed the connection during metadata exchange. (#854)
- [Telemetry] Dangling Pointer: Resolved a dangling pointer issue in
getName/getVersionin the telemetry plugin. (#1148) - [Benchmark] nixlbench Memory:
nixlbenchnow allocates page-aligned memory by default to ensure consistency. (#1060) - [Python] venv Support: Fixed Python test scripts to work correctly inside
uvvirtual environments. (#1106) - [UCX] Device API Detection: Fixed detection logic for UCX GPU device API support. (#990)
Benchmarks & Test Infrastructure
- [kvbench] FLOPs Estimation: Added Tensor Parallel (TP) scaling and MLP FLOPs to compute time estimates in
kvbench. (#1083) - [nixlbench] Consistency Check: Added data validation consistency checks to
nixlbench. (#1103)
Known Issues
Full Changelog: https://github.com/ai-dynamo/nixl/compare/0.8.0...0.9.0