v0.30.4

## Highlights - Metal: Much faster vector fused grouped-query attention for long context - CUDA: Several improvements to speed up LLM inference for CUDA backend - CUDA: Support for dense MoEs - CUDA: Better support for consumer GPUs (4090, 5090, RTX 6000, ...) ## What's Changed * patch bump for next release by @awni in https://github.com/ml-explore/mlx/pull/2991 * Fix fence by @awni in https://github.com/ml-explore/mlx/pull/2998 * Reverts changing the MLX_IBV_DEVICES to MLX_JACCL_DEVICES by @angeloskath in https://github.com/ml-explore/mlx/pull/2999 * fix distributed all_to_sharded bias shard axis from -2 to -1 by @gufengc in https://github.com/ml-explore/mlx/pull/2987 * Fix sharding of quantized models with non-power-of-2 bits by @kernelpool in https://github.com/ml-explore/mlx/pull/3006 * Update CCCL to v3.1.3 by @zcbenz in https://github.com/ml-explore/mlx/pull/3012 * Fix python package install path in stubgen by @zcbenz in https://github.com/ml-explore/mlx/pull/3009 * Type Enhancement for Func Transforms and Bug Fix by @XXXXRT666 in https://github.com/ml-explore/mlx/pull/3003 * Do not clear disk space in setup-linux by @zcbenz in https://github.com/ml-explore/mlx/pull/3013 * Do not give workflow boolean inputs default values by @zcbenz in https://github.com/ml-explore/mlx/pull/3014 * Fix negative dim indexing by @MillaFleurs in https://github.com/ml-explore/mlx/pull/2994 * Windows CI by @zcbenz in https://github.com/ml-explore/mlx/pull/3021 * Optimize erf function with expm1f in Metal backend by @bjornefisk in https://github.com/ml-explore/mlx/pull/3025 * [CUDA] Faster grouped mm by @zcbenz in https://github.com/ml-explore/mlx/pull/3011 * PR 3007 Fix Seg Fault by @MillaFleurs in https://github.com/ml-explore/mlx/pull/3008 * Use higher precision for linspace with double by @awni in https://github.com/ml-explore/mlx/pull/3029 * Handle data smaller than BUFFER_SIZE in jaccl recv by @rltakashige in https://github.com/ml-explore/mlx/pull/3033 * build 26.0 release in actions by @awni in https://github.com/ml-explore/mlx/pull/3035 * Remove xmlrunner from macOS CI by @zcbenz in https://github.com/ml-explore/mlx/pull/3032 * Columnwise quantize by @nastya236 in https://github.com/ml-explore/mlx/pull/2989 * Turn nccl_stub into a normal target by @zcbenz in https://github.com/ml-explore/mlx/pull/3037 * Use cuda::std for math ops by @zcbenz in https://github.com/ml-explore/mlx/pull/3041 * win: symbol exports and minor fixes by @dhiltgen in https://github.com/ml-explore/mlx/pull/3024 * CUDA gather mv by @angeloskath in https://github.com/ml-explore/mlx/pull/3039 * Link with prebuilt OpenBLAS and fix shared libs build on Windows by @zcbenz in https://github.com/ml-explore/mlx/pull/3036 * Allow take on empty array when it makes sense by @awni in https://github.com/ml-explore/mlx/pull/3046 * Add missing include to buffer_cache.h by @Anri-Lombard in https://github.com/ml-explore/mlx/pull/3053 * Build and test python package on Windows CI by @zcbenz in https://github.com/ml-explore/mlx/pull/3049 * Fix some MSVC compilation errors by @zcbenz in https://github.com/ml-explore/mlx/pull/3048 * Use C++20 by @zcbenz in https://github.com/ml-explore/mlx/pull/3050 * Faster two pass sdpa by @awni in https://github.com/ml-explore/mlx/pull/3023 * Find system-installed cuDNN on Windows by @zcbenz in https://github.com/ml-explore/mlx/pull/3052 * Fix some NVCC warnings when building CUDA backend with MSVC by @zcbenz in https://github.com/ml-explore/mlx/pull/3038 * Hide symbols by default for mac/linux by @zcbenz in https://github.com/ml-explore/mlx/pull/3057 * [CUDA] Fast sorting by @awni in https://github.com/ml-explore/mlx/pull/3060 * Fix flaky macOS test by @awni in https://github.com/ml-explore/mlx/pull/3063 * Update pre-commit hooks and versions for clang-format, black, and isort by @NripeshN in https://github.com/ml-explore/mlx/pull/3059 * GPU discovery by @dhiltgen in https://github.com/ml-explore/mlx/pull/3055 * Add NAX Split-K GEMM for large-K matmuls to improve performance by @hxu296 in https://github.com/ml-explore/mlx/pull/3018 * Improve CPU discovery by @dhiltgen in https://github.com/ml-explore/mlx/pull/3068 * Fix long cache file path on Windows by @zcbenz in https://github.com/ml-explore/mlx/pull/3065 * Better support consumer CUDA GPUs by @jessegross in https://github.com/ml-explore/mlx/pull/3056 * Delay load CUDA libs and resolve DLL paths at runtime by @zcbenz in https://github.com/ml-explore/mlx/pull/3061 * Do not require ConcurrentManagedAccess when not used by @zcbenz in https://github.com/ml-explore/mlx/pull/3062 * Fp qmv by @awni in https://github.com/ml-explore/mlx/pull/2984 * remove thrust by @awni in https://github.com/ml-explore/mlx/pull/3067 ## New Contributors * @gufengc made their first contribution in https://github.com/ml-explore/mlx/pull/2987 * @kernelpool made their first contribution in https://github.com/ml-explore/mlx/pull/3006 * @bjornefisk made their first contribution in https://github.com/ml-explore/mlx/pull/3025 * @rltakashige made their first contribution in https://github.com/ml-explore/mlx/pull/3033 * @dhiltgen made their first contribution in https://github.com/ml-explore/mlx/pull/3024 * @hxu296 made their first contribution in https://github.com/ml-explore/mlx/pull/3018 * @jessegross made their first contribution in https://github.com/ml-explore/mlx/pull/3056 **Full Changelog**: https://github.com/ml-explore/mlx/compare/v0.30.3...v0.30.4

mlx

More C++ Projects

tensorflow

electron

godot

llama.cpp

More C++ Projects

tensorflow

electron

godot

llama.cpp

Highlights

What's Changed

New Contributors