Blog post: https://huggingface.co/blog/EricB/mistralrs-v0-5-0
Thank you to all contributors for this release! This release includes the following highlights but also countless improvements, fixes, and optimizations.
Support for many more models:
Gemma 3
Qwen 2.5 VL
Mistral Small 3.1
Phi 4 Multimodal (image only)
Native tool calling support for:
Llama 3.1/3.2/3.3
Mistral Small 3
Mistral Nemo
Hermes 2 Pro
Hermes 3
Tensor Parallelism support (NCCL)!
FlashAttention V3 support and integration in PagedAttention
30x reduction in ISQ times on Metal!
Revamped prefix cacher system
What's Changed
Allow using library in CurrentThread runtime by @sgrebnov in https://github.com/EricLBuehler/mistral.rs/pull/1082
Improve accuracy of uqff auto device map by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1084
DeepSeekV3 sigmoid support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1092
GPU-accelerated sampling (+5% decode perf) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1094
Fix missing perceiver_config in qwen2vl by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1096
More topk methods for deepseek 2/3 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1097
More accurate layer size computation for deepseek 2/3 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1098
Improve streaming UX by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1102
Faster fp8 blockwise dequant by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1100
DS2/3 paged attn by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1103
Faster bincount by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1104
PagedAttention prompt chunking support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1105
Refactor server SSE by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1107
PagedAttention + FlashAttention (and FlashAttention V3) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1109
Take KEEP_ALIVE_INTERVAL into account by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1111
Refactor enable of flash attn by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1110
Fix imatrix isq quantize_onto by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1112
Tensor parallelism and pipeline parallelism by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1113
Bump openssl from 0.10.69 to 0.10.70 by @dependabot in https://github.com/EricLBuehler/mistral.rs/pull/1121
Allow chat streaming to use tools by @Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/1088
New file format for imatrix: .cimatrix by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1004
Fix isq with bias for column parallel by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1128
Multi-node support for tensor parallelism by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1125
Add an NCCL feature flag by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1129
Fix mistral 2501 gguf by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1131
Add jinja strftime_now function by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1132
Multiple models multi node by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1136
Remove unexpected cp behavior by @jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1141
Revamp speculative decoding! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1027
Fuse MLP mul-and-act by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1142
Short-circuit dry sampling: +6% T/s by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1143
Integrate fused MLP mul-act for more models! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1144
Use cudarc 0.13.5 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1145
Handle HF_HUB_CACHE env var by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1146
FlashAttention V2/V3 metadata with support for device location by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1148
FP8 blockwise dequant cuda kernel by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1149
Blockwise FP8 CUDA for cc < 800 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1150
Fix chat sampling response by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1154
Multiple processes for TP by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1152
Ensure we do not bind the port for daemon processes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1158
Handle CUDA_NVCC_FLAGS in flash attn v3 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1160
build fix for arm. by @jamesvren in https://github.com/EricLBuehler/mistral.rs/pull/1164
Working PrefixCacherV2! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1168
Implement Phi-4 Multimodal! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1163
No extra split/cat pair in rope by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1169
Remove gpu<>cpu sync for faster long-context by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1170
Refactor NCCL device mappers by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1172
Bump ring from 0.17.11 to 0.17.13 by @dependabot in https://github.com/EricLBuehler/mistral.rs/pull/1179
DSV3/R1 fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1173
Fix diffusion device mapping by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1187
Internal abstraction for distributed op by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1188
Make Sequence::set_toks more safe by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1190
Fix CI tests out of storage? by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1191
Internal abstraction for distributed op by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1189
Fix build_cuda_all.yaml CI by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1193
Support tensor parallelism for vision models! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1194
Always pass _USE_MATH_DEFINES for CUDA by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1195
Remove matmul via f16 framework by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1196
Remove API for matmul_via_f16 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1197
Add UQFF text/vision model API by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1198
Complete qwen2_5_vl, and some fixes by @brrr in https://github.com/EricLBuehler/mistral.rs/pull/1184
Implement Gemma 3 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1201
Add Gemma 3 vision support! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1202
Manually fixup sentencepiece detok by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1204
More vision models with TP by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1200
Fix topology link in the docs by @etiennebalit in https://github.com/EricLBuehler/mistral.rs/pull/1205
Gemma3 1b support and optimized rotating cache by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1206
Improve rotating kv cache, prefix cacher system by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1207
Better handling for kvcache set_len by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1208
Update deps and use rand 0.9 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1210
Update hf hub dep, add initial blockwise fp8 GEMM tests by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1212
Growable RotatingKvCache and fixes for Phi-4 mini by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1215
Gemma 3 cuda fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1217
Add pydantic schema examples! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1219
Sliding window attention fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1220
adapt to rig crate as client by @benliao in https://github.com/EricLBuehler/mistral.rs/pull/1214
Implement Mistral 3! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1221
Metal SDPA with masking by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1225
Send [DONE] SSE chunk per openai spec by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1226
Fix handling of device when compiled for but disabled nccl by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1227
Fix nccl blocking case by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1228
Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1229
OpenAI API compatability fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1230
[Breaking] Automatic server logging by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1231
Use default stream for flash attn by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1232
Bump version to 0.5.0 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1233
New Contributors
@sgrebnov made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1082
@jncraton made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1141
@jamesvren made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1164
@brrr made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1184
@etiennebalit made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1205
@benliao made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1214
Full Changelog: https://github.com/EricLBuehler/mistral.rs/compare/v0.4.0...v0.5.0