0.8.0

Summary

NVIDIA® NIXL Release 0.8.0 delivers significant performance improvements, major new capabilities, and important dependency updates. Key highlights include a massive optimization for large-batch workloads in the UCX backend, the introduction of a new POSIX backend using Linux AIO for high-performance storage I/O, and direct CUDA memory registration support for the Libfabric backend.

This version contains breaking changes, including the removal of the legacy Multi-Object UCX backend and an update to the minimum required Libfabric version. It also introduces support for Python 3.13 and changes the default build type to Release for optimized performance out of the box.

Major Features & Improvements

UCX Performance for Large-Batch Workloads: The request handling mechanism in the UCX backend has been overhauled to reduce overheads. For workloads with large batches (~64k) of small messages (<1KB), as in sglang and other LLM inference engines that use paged attention, this change significantly reduces latency and improves time-to-first-token (TTFT). Internal benchmarks show a ~50% performance increase in nixlbench and a ~20% TTFT reduction in sglang for these scenarios. (#982)
Linux AIO plugin for the POSIX backend: The POSIX backend now leverages the Linux Asynchronous I/O (AIO) API where available. This provides a high-performance, asynchronous interface for data transfers to and from local storage. Internal benchmarks show an increase in read throughput for read sizes above 100 kB (#885)
Libfabric CUDA Memory Registration: The Libfabric backend can now directly register CUDA memory regions using fi_mr_regattr. Thus adds support for extended memory registration attributes and optimized RDMA behavior. (#960)
Python: Added support for Python 3.13. (#994)

Breaking Changes

UCX Multi-Object Backend Removed: The legacy Multi-Object (UCX_MO) backend has been removed. Users should migrate to the primary UCX backend, which now incorporates multi-device support. (#898)
Libfabric Minimum Version Increased: The minimum required version of Libfabric has been raised to v1.21.0 to support new features. (#961)
Default Build Type is now Release: When building from source, the default build type is now Release instead of Debug. This ensures that default builds are optimized for performance. (#869)

API Changes

[Rust] New RegDescList and XferDescList APIs have been added to the Rust interface for descriptor management. (#828)
[Python] Obsolete and unused code from the Python API has been removed. (#985)

Enhancements

[Build] The build system now searches for libraries in paths specified by the NIXL_PREFIX environment variable, making it easier to link against custom builds. (#998)

Bugfixes

[Core] Metadata exchanges over sockets have been improved with better error handling. (#999)
[Core] The metadata exchange over sockets communication queue is now fully flushed before stopping the listener thread, preventing potential data loss during shutdown. (#830)
[Libfabric] Fixed multiple issues with metadata handling for partial loads and offset calculations that could lead to data corruption. (#969, #978)
[Bindings] Resolved a crash in Python examples that occurred when the NIXL_PLUGIN_DIR environment variable was not set. (#963)
[Rust] Fixed an issue that prevented Rust stubs from building correctly. (#1001)

Benchmarks & Test Infrastructure

[nixlbench] Fixed a memory management bug related to cudaFree. (#965)
[nixlbench] The tool will now correctly exit with a failure code if an I/O vector consistency check fails. (#992)
[CI] Switched CI jobs from PyTorch-based images to cuda-dl-base images for better unification and consistency. (#924)

Known Issues

[GPUNETIO] The GPUNETIO plugin is not available in CUDA 13 environments. This will be addressed in a future release.

Full Changelog: https://github.com/ai-dynamo/nixl/compare/0.7.1...0.8.0

nixl

More C++ Projects

tensorflow

electron

godot

llama.cpp