v25.10.00
This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.10/.
Highlights
Added functionality
- Implement
cupynumeric.in1d. - Add DLPack import/export support to cuPyNumeric ndarrays.
- Allow batched input for
cupynumeric.linalg.solve.
Performance improvements
- Optimized implementation for the special
axis=case ofcupynumeric.take. - Improve heuristics for choosing between batched and unbatched matrix multiplication.
- Improved implementation of
cupynumeric.nonzerothat uses no additional scratch space. - Identify special cases of advanced indexing that can be executed faster using
cupynumeric.einsum.
Documentation / profiling
- Add a tutorial on using Legate Tasks to extend cuPyNumeric.
- Add a user warning when an operation (e.g. printing to the console) causes a sharded array to be gathered onto a single memory.
- Add sub-boxes to the Legate profiler, showing how long the Python interpreter spends inside cuPyNumeric API calls.
Breaking changes
- Move nightly conda packages to a dedicated channel,
-c legate-nightly.
Known issues
- We are aware of hangs occurring under certain platforms and UCC configurations, when using cuSolverMp-backed multi-GPU operations (Cholesky factorization and linear solve). We expect these to be fixed by the 25.11 release, that updates to cuSolverMp 0.7.
Full Changelog: https://github.com/nv-legate/cupynumeric/compare/v25.08.00...v25.10.00