Prefix Caching: We have implemented Prefix Caching for PagedAttention (#1750). This significantly accelerates multi-turn conversations and RAG workflows by reusing KV cache for shared prompt prefixes.
Major model expanstion: Support for the Embedding Gemma, Qwen 3 Embedding, Gemma 3n, GLM-4, Granite Hybrid MoE, GLM-4 MoE, GLM-4 MoE Lite
Dynamic model loading: Dynamic Model Loading: The server now supports loading and unloading models at runtime (#1828)
Performance: Added support for CUDA 13.0/13.1 (#1767) and introduced highly optimized fused kernels (GEMV, GLU) and blockwise FP8 kernels for significant speedups on NVIDIA GPUs.
candle 0.9.2: We have migrated to the official crates.io release of candle 0.9.2, stabilizing our backend dependencies!
Improve automatic tool call by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1460
chore: Dockerfile.cuda-all configurable threads by @polarathene in https://github.com/EricLBuehler/mistral.rs/pull/1458
chore: Dockerfile.cuda-all - Merge RUN for apt-get install by @polarathene in https://github.com/EricLBuehler/mistral.rs/pull/1459
Add fallback definition for metal::isnan by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1463
chore: Dockerfile - Drop runtime rayon thread ENV by @polarathene in https://github.com/EricLBuehler/mistral.rs/pull/1465
Remove duplicate calls for api_dir_list by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1474
Fix transient pyo3 dep in mistralrs_mcp by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1478
Fix objc dep when non macos by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1480
Fix phi4 mini + nccl cache issue by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1481
Fix phi3.5 moe (#1447) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1482
Support GLM4 model by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1437
Refactor distributed backend by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1484
Cap metal paged attn kv allocation by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1485
Better paged attn metal cap by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1486
Server core: consolidate and unify route handlers and API surface by @matthewhaynesonline in https://github.com/EricLBuehler/mistral.rs/pull/1423
Support qwen3 gguf by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1488
Make bos/eos token IDs optional by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1493
Remove python deps from CUDA dockerfiles by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1487
Handle noncontiguous v in naive_sdpa by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1499
Server Core: refactor Paged Attention configuration (breaking change) by @matthewhaynesonline in https://github.com/EricLBuehler/mistral.rs/pull/1500
Use StorageModePrivate for Metal PA kv cache by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1506
fix(stream): emit field in tool-call deltas for schema compliance by @Sbargaoui in https://github.com/EricLBuehler/mistral.rs/pull/1507
PagedAttention kv-cache quantization (F8E4M3) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1400
New Contributors
@Sbargaoui made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1507
@GaetanLepage made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1510
@ammar-elsabe made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1535
@AlpineVibrations made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1545
@rubiktubik made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1548
@christer-eriksson made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1588
@ryanli made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1611
@sonicrules1234 made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1614
@emmanuel-ferdman made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1635
@semihpolat made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1572
@anchpop made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1675
@parasyte made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1662
@htiennv made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1691
@llamaha made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1712
@lmzuccarelli made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1742
@Cooksey99 made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1752
@vigsterkr made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1793
@krampenschiesser made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1818
Full Changelog: https://github.com/EricLBuehler/mistral.rs/compare/v0.6.0...v0.7.0
Validate model name in OpenAI API by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1509
Fix mcp import in doc string by @GaetanLepage in https://github.com/EricLBuehler/mistral.rs/pull/1510
Add multi-model support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1512
Add stars label to readme by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1513
Handle base_model.model case in lora by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1514
Add thread_local! for engine-specific const/static by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1517
Fix MCP doc test by @GaetanLepage in https://github.com/EricLBuehler/mistral.rs/pull/1511
Allow disabling metal precompilation by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1518
Rust 1.88 clippy by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1522
Fix cuda warnings by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1526
Fix panic on error decoding tokens by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1527
Split Marlin and Paged Attention kernels for faster build by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1525
chore: update llguidance by @ammar-elsabe in https://github.com/EricLBuehler/mistral.rs/pull/1535
Add the SmolLM3 model by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1501
Add Gemma 3n support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1519
Fix sequence length check by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1546
update candle version by @AlpineVibrations in https://github.com/EricLBuehler/mistral.rs/pull/1545
Add the capability to build for ios by @rubiktubik in https://github.com/EricLBuehler/mistral.rs/pull/1548
chore: Dockerfile - Remove redundant symlink creation + ENV by @polarathene in https://github.com/EricLBuehler/mistral.rs/pull/1504
Apply changes from Gemma 3n weight & config reupload by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1553
Faster multimodal merging for gemma3n by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1558
Improved Rust UQFF api by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1563
Uqff minor tweaks and optimizations by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1565
Update Candle backend, support new dtypes, CUDA 12.9 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1566
Fix nccl regression by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1569
Fix metal regression by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1575
Initial support for OpenAI Responses API by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1580
Sanitize returned server errors by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1581
Fix invalid HTML warning by @szepeviktor in https://github.com/EricLBuehler/mistral.rs/pull/1583
Make typos configuration stricter by @szepeviktor in https://github.com/EricLBuehler/mistral.rs/pull/1582
fix: use try_init when initialize tracing by @christer-eriksson in https://github.com/EricLBuehler/mistral.rs/pull/1588
Add blockwise fp8 quantize kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1586
Support old gemma3n intermediate size config by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1594
Handle when there are invalid vision tower weights by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1595
Fix needless bool warning by @jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1599
Use smaller model in streaming example by @jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1598
Simplify streaming example output by @jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1597
Add vector 1xK fp8 kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1600
Target generic CPUs when building Python package by @jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1596
Fix tests in CI by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1603
Add tiktoken -> Tokenizer conversion utilities by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1604
Reworked attention chunking by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1591
Fix cuda ring compilation by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1608
mistralrs-quant: Fix build when feature=+cuda,-ring. by @ryanli in https://github.com/EricLBuehler/mistral.rs/pull/1611
chore: Dockerfile - Add SHELL instruction by @polarathene in https://github.com/EricLBuehler/mistral.rs/pull/1612
Send the mcp server an initialization notification to let it know the client is done initializing plus finish off making all 3 supported mcp transports properly increment their request ids by @sonicrules1234 in https://github.com/EricLBuehler/mistral.rs/pull/1614
Add Claude Code GitHub Workflow by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1616
Add MXFP4 gather gemm support by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1615
Update ISQ Python example by @jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1617
Rust 1.89 clippy by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1621
Support Qwen3 MoE GGUF model with fast MoE kernel by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1622
Disable CUDA event tracking, use cudarc 0.17 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1623
Enforce workspace msrv by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1631
Fix bench async crash by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1632
Bump tracing-subscriber from 0.3.19 to 0.3.20 by @dependabot[bot] in https://github.com/EricLBuehler/mistral.rs/pull/1633
Update MCP server guide source by @emmanuel-ferdman in https://github.com/EricLBuehler/mistral.rs/pull/1635
Batch of bugfixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1634
Add repetition_penalty to SamplingParams. by @ryanli in https://github.com/EricLBuehler/mistral.rs/pull/1638
Rust 1.90 clippy by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1639
Drain dummy run reciever to fix sender dropping by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1645
Handle cpu dtype requirements for conv by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1650
Audio processing functions have been added: normalize, apply_fade, an… by @semihpolat in https://github.com/EricLBuehler/mistral.rs/pull/1572
Handle case where gemma 3n q != (k=v) devices by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1653
Refactor/simplify paged attn modules by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1654
Fix engine busyloop by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1655
Include stubs in maturin source builds by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1656
Implement Qwen 3 VL! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1657
Refactor flash attn dispatch to handle CPU by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1671
Fix cpu flash attn mask case by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1672
Remove restriction on qwen vl batch size by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1673
Fix gemma3 example by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1674
feat: Remove ram limits for CPU by @anchpop in https://github.com/EricLBuehler/mistral.rs/pull/1675
Fix hang and performance drop with Metal by @parasyte in https://github.com/EricLBuehler/mistral.rs/pull/1662
Fix inverted logic bug in DRY sampler sequence breaker encoding by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1679
Fix panic in prompt token truncation logic by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1678
Slightly faster mistralrs-vision normalize path by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1681
Correctly handle tied embeddings in qwen3_vl config by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1682
Improved flexibility for Topology by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1683
Temporarily disable fast sampler by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1685
Support embedding models: EmbeddingGemma by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1684
Implement Qwen 3 Embedding by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1686
Fix apply chat template tool call case by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1689
Fixes for qwen 2.5 vl by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1690
Remove once_cell dependency from multiple crates for improved performance and consistency. by @htiennv in https://github.com/EricLBuehler/mistral.rs/pull/1691
Ensure paged attn is disabled for embedding models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1693
Add example for batching in embedding models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1694
Expose max seqlen api on Engine by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1695
Support cuda 13.0 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1697
Refactor search embedding to use EmbeddingGemma by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1698
Fix flash attn on cuda 13.0 build by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1704
Fix embedding inputs processor in search by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1705
Fix auto loader confusion by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1707
Fixes for qwen 2.5 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1708
Fix server hang with mcp by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1706
Cuda clippy fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1709
Docs: Clarify correct GPU arch for FlashAttention by @llamaha in https://github.com/EricLBuehler/mistral.rs/pull/1712
Add custom logging example by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1719
Respect silent setting in progress bar by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1720
Fix overcounting on nonmapped params in device mapping by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1721
Misc bugfixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1730
Support the IBM Granite Hybrid MoE models! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1731
Fix isnan metal build failure by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1732
Fix toplevel quantization config location by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1733
Switch candle source to HF main! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1722
Optimize FP8 KV Cache with dedicated scale_update kernel by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1651
Support Fused MoE for unquantized Qwen3 Models by @guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1723
Add metal kv scale update kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1736
Update cudarc version 0.17 -> 0.18 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1745
Bucket and preempt paged attn sequences by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1746
Optimized cpu-side sampling routines by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1747
Support qwen 3 vl moe by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1748
Implement prefix caching for paged attn by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1750
Webchat various improvements, fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1751
Add cudnn ubi9 dockerfile for rhel based systems by @lmzuccarelli in https://github.com/EricLBuehler/mistral.rs/pull/1742
Fix FP8 handling for stacked format MoE experts by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1755
No longer getting cache issue from HF runs by @Cooksey99 in https://github.com/EricLBuehler/mistral.rs/pull/1752
Use mmap for more cases during loading by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1756
Add blockwise fp8 gemm and grouped gemm by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1759
Update candle dep by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1764
[Granite Hybrid] Fix Mamba softplus NaNs, GPU mismatches, paged_attn by @haricot in https://github.com/EricLBuehler/mistral.rs/pull/1761
Support cuda 13.1 (cudarc 0.18.2) by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1767
Fix cc 75 moe kernel compilation by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1768
Gemma3n audio encoder various fixes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1769
Support optimized AFQ on CUDA by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1770
Use candle 0.9.2-alpha.2 from crates.io! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1775
Fix candle dep, flash-attn doesnt build on alpha.2 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1779
Support GPT-OSS and harmony! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1760
Add specialized GEMV cuda kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1788
Add fused GLU kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1789
Add GB10 gpu runner by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1790
Support cuda integrated memory systems by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1791
Fix is_prefill case for MoEExperts by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1792
Update candle dep by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1795
Update reqwest 0.12 -> 0.13 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1796
Add utils.metal include to fused_glu.metal by @vigsterkr in https://github.com/EricLBuehler/mistral.rs/pull/1793
fix #1794 by @vigsterkr in https://github.com/EricLBuehler/mistral.rs/pull/1797
Add cuda moe gemv kernel by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1798
Integrate custom gguf indexed_moe kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1800
Fix shared memory mismatch on v100 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1802
chore(deps): update candle revision by @vigsterkr in https://github.com/EricLBuehler/mistral.rs/pull/1801
More complete docs by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1803
Add builtin agentic loop by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1807
Automatic Reference Counting for PagedAttention Blocks by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1808
Fixes for x-lora models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1812
Bump cudarc to 0.18.2, support cuda 13.1 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1813
Support cuda 13.1 in mistralrs-quant by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1814
Support Ministral 3 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1816
Fix UQFF generation on CUDA by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1817
Fix cublaslt device management on multi-gpu by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1819
Fix nccl vision model case by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1820
Hotfix: Remove fastmath from candle-kernels by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1821
Support the OpenResponses specification by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1822
Fix missing default for gemma 2 tie word embeddings by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1823
Attempt to improve metal paged attn stability by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1824
Fix stack buffer overflow in cuda topk_softmax by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1825
Metal normalization accumulate in f32 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1826
Support model loading/unloading by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1828
Clarify paged attn prefix caching/sequence prefix caching by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1829
Support the GLM 4 Flash model! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1830
Replace serde_yaml with serde_saphyr by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1832
Switch to cudarc 0.19, fix d2d cuda copies by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1833
Improve mistralrs-mcp API with connevtion/tool/lifecycle management, update MCP schema to by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1834
Update most deps to latest versions by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1835
reuses cuda devices instead of instantiating multiple main contexts by @krampenschiesser in https://github.com/EricLBuehler/mistral.rs/pull/1818
Add MLA attention decode kernels for DSv2/v3 and GLM4.7 Flash by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1837
Switch deepseek v2 and v3 to new expert implementation by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1838
Use crates.io candle 0.9.2! by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1839
Fix doc mistakes by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1840
Support auto pipeline for diffusion and speech models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1841
Support aliasing for multi models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1843
v0.7.0 preparation: mistralrs-cli, revamped documentation, readme update by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1848
Use feature flags to handle fp8 gating in CUDA by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1849
Handle xcode 26+ default metal standard by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1846
Handle missing metal toolchain in build script, troubleshooting docs by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1850
Support fp8 in llama 4 models by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1851
Simplify mistralrs rust sdk exposed items by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1852
Final changes for v0.7.0 by @EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1853