v25.08.00
This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.08/.
New features
Added functionality
- Multi-node multi-GPU capable SVD, specialized for tall-skinny matrices
cupynumeric.crosscupynumeric.insertcupynumeric.logspacecupynumeric.real_if_closecupynumeric.rootscupynumeric.ravel_multi_indexcupynumeric.copytocupynumeric.diagflatcupynumeric.deletecupynumeric.nan_to_num- Support multi-axis reductions
Performance Improvements
- Improve robustness & speed of
cupynumeric.sort, by combining allocations where possible, and adding synchronization barriers around NCCL collectives. - Remove some extraneous blocking that was only necessary to match the behavior of NumPy 1.x.
- Improve performance of NumPy fallback, in particular removing extraneous array copies, and adding special cases for quick fallback to functions such as
cupynumeric.concatenate.
Miscellaneous
- Unify all environment variables that control cuPyNumeric's NumPy fallback heuristics, to a single one,
CUPYNUMERIC_MAX_EAGER_VOLUME. - Allow any available BLAS implementation to be used in a source build.
Full Changelog: https://github.com/nv-legate/cupynumeric/compare/v25.07.00...v25.08.00