Unclaimed project
Are you a maintainer of mistral.rs? Claim this project to take control of your public changelog and roadmap.
Claim this projectChangelog
mistral.rs
Fast, flexible LLM inference
Back to changelog
🔥 Highlights from v0.6.0
🚀 Major Features
- Llama 4 support and Qwen 3 / MoE / VL models, including DeepSeek and DeepCoder integrations
- Multimodal prefix caching, paged attention scheduler improvements, and faster Metal/CUDA backends
- Web chat app with chat history, file uploads, speech generation, and revamped tool-calling/search
- Fast sampler and CPU FlashAttention with improved performance and accuracy
- Metal and CUDA: major improvements in quantization (AFQ, ISQ), UQFF handling, and memory optimizations
- MCP (Model Context Protocol): new server endpoints, docs, and integrated client
- Vision and audio expansion: support for SIGLIP, Dia 1.6b TTS, conformer backbone (Phi-4MM), auto loaders, and vision tool prefixes
🧠 Inference Optimizations
- Lightning-fast AFQ on CPU, optimized Qwen 3 MoE on Metal, and paged attention fixes
- Unified FlashAttention backend and automatic method selection for ISQ
- Metal precompilation support and reduced autorelease thrashing
🧰 Dev Improvements
- Refactored engine architecture, KV cache, attention backends, and device mapping logic
- Centralized dependency management and cleaner internal abstractions
- Streamlined and faster LoRA support
🎉 Other
- Revamped README, AGENTS.md, and new benchmarking scripts
- Interactive mode now shows throughput, supports Gumbel sampling, and better runtime sampling controls
- Expanded quant and GGUF support: AWQ, Qwen3 GGUF, and prequantized MLX compatibility
⸻
What's Changed
- Fix handling of Metal fused attn head dims by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1234
- Support paged attn for vision model rust api by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1235
- [Breaking] Support setting HF cache path by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1237
- Support tool calling for DeepSeek models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1239
- Server image processing refactor by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1244
- Optimized CUDA RoPE kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1247
- Typo fix (add_speial_tokens to add_special_tokens) by @edwko in https://github.com/EricLBuehler/mistral.rs/pull/1246
- Fixes for UQFF + distributed layers by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1250
- Automatic agentic search integration (
web_search_options) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1243
- Format kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1251
- Add quantize guards for UQFF deserialize by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1252
- Refactor cuBLASlt-related code by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1253
- Update deps, bump pyo3 version by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1259
- Faster cuda FP8 performance by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1257
- Rust 1.86 clippy by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1260
- Refactor engine arch by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1262
- Revamped LoRA support - removing the Ordering system by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1263
- Fast Metal-specific quantization method: AFQ by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1264
- Support prequantized models from MLX by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1265
- Automatic ISQ to select fastest & most accurate method by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1266
- Improved usage metrics by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1267
- Bump tokio from 1.44.1 to 1.44.2 by @dependabot in https://github.com/EricLBuehler/mistral.rs/pull/1270
- Gather MM ops in mistralrs-quant by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1272
- Improve performance of deepseek models by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1274
- Implement Llama 4 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1268
New Contributors
- @edwko made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1246
- @beeender made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1306
- @Slowki made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1314
- @omahs made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1329
- @szepeviktor made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1331
- @matthewhaynesonline made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1353
- @sempervictus made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1419
Full Changelog: https://github.com/EricLBuehler/mistral.rs/compare/v0.5.0...v0.6.0
Fixes for Llama 4 UQFF loading by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1275Support sharding for UQFF by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1276Fix bug for group-topk (group_limited_greedy) in deepseek models by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1278Support the DeepCoder model by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1279Improved PagedAttn scheduling accuracy by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1282Fixes for scheduling image seqs with pagedattn by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1283update to llguidance 0.7.16 by @mmoskal in https://github.com/EricLBuehler/mistral.rs/pull/1284Update dependencies by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1286Much faster image inputs processing by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1289Add more SDPA head dims for much faster SIGLIP by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1290Show throughput in interactive mode by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1291Unify bitwise operations by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1288Multimodal prefix caching support! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1298Interactive mode improvements by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1299Add the Qwen 3 and Qwen 3 MoE models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1285Revamped and streaming web search support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1301Handle vision messages or different tool call prefixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1302Simplify prefix cacher by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1305Use rustyline to handle non-ascii in interactive mode by @beeender in https://github.com/EricLBuehler/mistral.rs/pull/1306Add more tools for automatic search by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1307Fix CPU hogging in interactive mode by @beeender in https://github.com/EricLBuehler/mistral.rs/pull/1309Add Metal precompilation support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1311Reduce thrashing of Metal autorelease by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1313make AdapterPaths and LoraAdapterPaths public by @Slowki in https://github.com/EricLBuehler/mistral.rs/pull/1314Refactor KV cache manager by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1315Add Audio and Speech model categories by @Slowki in https://github.com/EricLBuehler/mistral.rs/pull/1317Remove has_conv2d from vision model API by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1318Unified/automatic flash attention enabler by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1319Fix cublaslt 4d mask by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1320Qwen VL models fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1322Fixes for all vision models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1323Improved+faster LRU prefix cacher and sampler! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1321Inplace ISQ support and default to mmap by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1277Fix typos by @omahs in https://github.com/EricLBuehler/mistral.rs/pull/1329Fix Idefics 3 arch chat templating by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1330Remove two spaces from PR comment by @szepeviktor in https://github.com/EricLBuehler/mistral.rs/pull/1331Add automatic vision loader type by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1332Add the Dia 1.6b TTS model! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1304update llguidance to 0.7.20 by @Slowki in https://github.com/EricLBuehler/mistral.rs/pull/1334Add model category <> messages check by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1335Improve normalization integration test by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1340Fix streaming example print statement by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1339Fix normalization formula in comment by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1338Fix image_to_pixels for non-RGB images by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1337Fix typo in expect messages by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1342Don't use mmap on cuda by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1336Support AWQ format models by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1350Fix uqff dummy layer ISQ application by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1351Disable immediate isq if write_uqff by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1352Fixes for cuda UQFF by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1354Refactor Option references for model paths by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1347Add a script for server benchmarking by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1355Optimized Metal qmv_fast path by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1356Fast sampler by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1327Fix metal parallel sampling by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1357Add immediate isq predicates for qwen3 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1358Regressions fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1359Revamped and smaller readme by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1360Add a web chat app by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1362Add chat history support to web chat app by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1363Refactor web chat, fix multichat image restore by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1364Fix repeated immediate isq init by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1365Fix missing vision weights in Mistral3 UQFF by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1366Rolling shard creation for uqff files by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1367Fix unstability during isq of afq by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1368Support web chat file uploading by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1370Add speech generation support to the web chat! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1373Prefix caching for PagedAttention by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1369Metal PagedAttention accuracy improvements by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1374Handle images in paged attn scheduler by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1375Include schemas needed for chatcompletions endpoint by @matthewhaynesonline in https://github.com/EricLBuehler/mistral.rs/pull/1353Fix case where prefix cacher returns no toks by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1377Faster UQFF serialization by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1379Experimental AFQ on CPU support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1380Add CPU flash attention by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1382Refactor attention backends by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1384Set MacOS thread affinity for cpu attn by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1385Faster Qwen 3 MoE support on Metal by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1387Fix PagedAttention block leaks by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1388Fix cuda build again by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1389Bump version to 0.6.0 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1390Fewer .contiguous calls for qwen3 moe by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1391Allow speech models to accept batched inputs by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1393Ring distributed backend for Metal by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1238Add auto loader for vision/text detection by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1402Proposal: Create Mistral.rs Server Core Lib by @matthewhaynesonline in https://github.com/EricLBuehler/mistral.rs/pull/1346Support linear rope for llama3 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1408Fix vllama4 uqff loading by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1409Handle receiver disconnects by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1410Fix Qwen3 MoE device mapping irregularities by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1411Fix interactive mode URL parsing by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1412Refactor auto device map by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1413Enable runtime sampling tweaks in interactive mode by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1414Gumbel sampling for fast sampler by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1416Improved CPU flash attention accuracy & performance by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1417Provide chat_templates to container users by @sempervictus in https://github.com/EricLBuehler/mistral.rs/pull/1419Faster cpu flash attn by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1418Web search improvements (bm25, web chat) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1420Propely handle consecutive searches by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1421Update docs by @matthewhaynesonline in https://github.com/EricLBuehler/mistral.rs/pull/1422Better tool call detection logic by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1424Add web search hook callbacks by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1426Fix CUDA context switching, bind thread on CudaStorage drop by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1428Conditionally build seqlens tensors by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1429Add AGENTS.md by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1430Support QWen3 GGUF model by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1432Improved paged attn prefix caching by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1434Temporary fix for qwen3 gguf tokenizer by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1433Add tool callback support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1427centralize crate dependencies by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1438Fix bug in tokenizer created with gguf metadata by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1440Update deps by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1441Doc fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1442Downgrade rustyline 16.0.0 -> 15.0.0 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1444Support max_completion_tokens alias by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1451Add the conformer backbone (phi4mm audio) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1448Fix offline cache issue for gguf models by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1452Add MCP server endpoints by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1453MCP documentation pass by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1455Integrate an MCP client by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1456