v0.30.1

Highlights

RDMA over thunderbolt with the JACCL backend (macOS >= 26.2) (some numbers)
NAX with JIT so that they can be used in MLX Swift
CUDA improvements
- Many improvements to SDPA (masking, T_q != T_kv)
- Faster quantize/dequantize
- QQMM to make use of faster tensor cores
- Fix in col reduce speeds up training

patch + fix docs build by @awni in https://github.com/ml-explore/mlx/pull/2799
Fix macos release target and linux arm release by @awni in https://github.com/ml-explore/mlx/pull/2802
Fix cuda allocator copy condition by @awni in https://github.com/ml-explore/mlx/pull/2800
[CUDA] Partly fix random for large sizes by @awni in https://github.com/ml-explore/mlx/pull/2798
patch bump for future version by @awni in https://github.com/ml-explore/mlx/pull/2804

RDMA over thunderbolt with the JACCL backend (macOS >= 26.2) (some numbers)
NAX with JIT so that they can be used in MLX Swift
CUDA improvements
- Many improvements to SDPA (masking, T_q != T_kv)
- Faster quantize/dequantize
- QQMM to make use of faster tensor cores
- Fix in col reduce speeds up training

patch + fix docs build by @awni in https://github.com/ml-explore/mlx/pull/2799
Fix macos release target and linux arm release by @awni in https://github.com/ml-explore/mlx/pull/2802
Fix cuda allocator copy condition by @awni in https://github.com/ml-explore/mlx/pull/2800
[CUDA] Partly fix random for large sizes by @awni in https://github.com/ml-explore/mlx/pull/2798
patch bump for future version by @awni in https://github.com/ml-explore/mlx/pull/2804