0.8.0
Summary
NVIDIA® NIXL Release 0.8.0 delivers significant performance improvements, major new capabilities, and important dependency updates. Key highlights include a massive optimization for large-batch workloads in the UCX backend, the introduction of a new POSIX backend using Linux AIO for high-performance storage I/O, and direct CUDA memory registration support for the Libfabric backend.
This version contains breaking changes, including the removal of the legacy Multi-Object UCX backend and an update to the minimum required Libfabric version. It also introduces support for Python 3.13 and changes the default build type to Release for optimized performance out of the box.
Major Features & Improvements
- UCX Performance for Large-Batch Workloads: The request handling mechanism in the UCX backend has been overhauled to reduce overheads. For workloads with large batches (~64k) of small messages (<1KB), as in
sglangand other LLM inference engines that use paged attention, this change significantly reduces latency and improves time-to-first-token (TTFT). Internal benchmarks show a ~50% performance increase innixlbenchand a ~20% TTFT reduction insglangfor these scenarios. (#982) - Linux AIO plugin for the POSIX backend: The POSIX backend now leverages the Linux Asynchronous I/O (
AIO) API where available. This provides a high-performance, asynchronous interface for data transfers to and from local storage. Internal benchmarks show an increase in read throughput for read sizes above 100 kB (#885) - Libfabric CUDA Memory Registration: The Libfabric backend can now directly register CUDA memory regions using
fi_mr_regattr. Thus adds support for extended memory registration attributes and optimized RDMA behavior. (#960) - Python: Added support for Python 3.13. (#994)
Breaking Changes
- UCX Multi-Object Backend Removed: The legacy Multi-Object (UCX_MO) backend has been removed. Users should migrate to the primary UCX backend, which now incorporates multi-device support. (#898)
- Libfabric Minimum Version Increased: The minimum required version of Libfabric has been raised to v1.21.0 to support new features. (#961)
- Default Build Type is now
Release: When building from source, the default build type is nowReleaseinstead ofDebug. This ensures that default builds are optimized for performance. (#869)
API Changes
- [Rust] New
RegDescListandXferDescListAPIs have been added to the Rust interface for descriptor management. (#828) - [Python] Obsolete and unused code from the Python API has been removed. (#985)
Enhancements
- [Build] The build system now searches for libraries in paths specified by the
NIXL_PREFIXenvironment variable, making it easier to link against custom builds. (#998)
Bugfixes
- [Core] Metadata exchanges over sockets have been improved with better error handling. (#999)
- [Core] The metadata exchange over sockets communication queue is now fully flushed before stopping the listener thread, preventing potential data loss during shutdown. (#830)
- [Libfabric] Fixed multiple issues with metadata handling for partial loads and offset calculations that could lead to data corruption. (#969, #978)
- [Bindings] Resolved a crash in Python examples that occurred when the
NIXL_PLUGIN_DIRenvironment variable was not set. (#963) - [Rust] Fixed an issue that prevented Rust stubs from building correctly. (#1001)
Benchmarks & Test Infrastructure
- [nixlbench] Fixed a memory management bug related to
cudaFree. (#965) - [nixlbench] The tool will now correctly exit with a failure code if an I/O vector consistency check fails. (#992)
- [CI] Switched CI jobs from PyTorch-based images to
cuda-dl-baseimages for better unification and consistency. (#924)
Known Issues
- [GPUNETIO] The GPUNETIO plugin is not available in CUDA 13 environments. This will be addressed in a future release.
Full Changelog: https://github.com/ai-dynamo/nixl/compare/0.7.1...0.8.0