Unclaimed project
Are you a maintainer of MNN ? Claim this project to take control of your public changelog and roadmap.
Claim this project Changelog
MNN MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba. Full multimodal LLM Android App:[MNN-LLM-Android](./apps/Android/MnnLlmChat/README.md). MNN TaoAvatar Android - Local 3D Avatar Intelligence: apps/Android/Mnn3dAvatar/README.md
© 2026 AnnounceHQ. All rights reserved.
Back to changelogMNN 3.4.0 版本发布说明
发布日期 : 2026年2月
📌 版本概述
MNN 3.4.0 版本聚焦于 GPU/QNN 后端能力深化 、Attention 计算及长文本内存优化 与 GPU 线上运行稳定性 三大核心主题:
GPU/QNN 能力深化 : Vulkan 后端新增 LLM 推理支持并引入 CoopMat 矩阵加速指令;Metal 后端支持 TensorAPI 和 Flash Attention;QNN 后端扩展支持 Qwen3 系列和 VL 模型,并新增 Python 直接导出和 OmniQuant 量化能力。
Attention 与长文本内存优化 : CPU 和 Metal 后端全面支持 Flash Attention;CPU 支持 KV Cache 量化;新增 Prefix KV Cache 支持;新增统一的 attention_mode 配置选项,为长文本场景显著降低内存占用。
GPU 线上运行稳定性 : 新增 iOS 后台检测机制,当 APP 切到后台时 GPU 计算会被系统拒绝,现在会正确返回错误码;修复多个 GPU 后端的稳定性问题。
🚀 版本亮点
Vulkan LLM 支持 : Vulkan 后端新增 LLM 推理支持,覆盖更多 Android 设备
Vulkan CoopMat 加速 : Vulkan 支持协作矩阵 (CoopMat) 指令,大幅加速矩阵乘法运算
Metal TensorAPI 支持 : Metal 后端支持 TensorAPI,M5 芯片性能大幅提升
Metal Flash Attention : Metal 后端实现 Flash Attention,显著降低内存占用
CPU Flash Attention : CPU 后端支持 Flash Attention,新增 统一配置选项
attention_mode
CPU KV Cache 量化 : CPU Attention 支持 KV Cache 量化,降低内存占用
Prefix KV Cache : 支持 Prefix KV Cache,提升长文本推理效率
投机解码增强 : Metal 支持 Eagle3 和 Lookahead 投机解码;提供 Qwen3 系列 Eagle3 权重
QNN 增强 : QNN 后端支持 Qwen3 系列和 VL 模型;支持 Python 直接导出;支持 OmniQuant 量化
混合精度量化 : llmexport 支持通过配置文件进行混合精度量化
llmexport 重构 : 重构模型导出逻辑,模型抽象更加完善
Loop 算子 GPU 优化 : OpenCL 和 Metal 优化 Loop 算子,支持纯 GPU 实现无需回退 CPU
RISC-V Vector (RVV) 优化 : 全面使用向量内置函数优化核心算子
Vulkan Buffer FP16 : Vulkan buffer 模式完整支持 FP16 计算路径
KleidiAI 集成 : 新增编译选项,默认启用 KleidiAI fp32 深度卷积内核
✨ 新功能
LLM/VLM
Flash Attention : CPU 和 Metal 后端支持 Flash Attention,Metal 支持三种模式可选
KV Cache 量化 : CPU Attention 支持 Query/Key/Value 的 8bit 非对称量化
Prefix KV Cache : 支持 Prefix KV Cache,提升长文本推理效率
投机解码 : Metal 后端支持 Eagle3 和 Lookahead 投机解码;提供 Qwen3 系列 Eagle3 权重
llmexport 重构 : 重构模型导出逻辑,支持混合精度量化配置和无权重导出
新增 attention_mode 选项 : 统一配置 Attention 行为,废弃 quant_qkv 选项
CPU: 0, 1, 2, 8, 9, 10(默认 8),控制 Flash Attention 和 QKV 量化
GPU (Metal): 0, 8, 16(默认 8),控制 Flash Attention 实现方式
新增 Fun-Audio-Chat-8B 模型支持;LoRA 支持克隆 LayerNorm;llm_bench 支持 JSON 输出
GPU、QNN 后端
Vulkan : 新增 LLM 推理支持;CoopMat 矩阵加速;buffer 模式完整支持 FP16
Metal : TensorAPI 支持(M5 芯片性能提升);iOS 后台检测机制;Loop 算子优化
OpenCL : Loop 算子纯 GPU 实现;mmap 权重存储
QNN : 支持 Qwen3 系列和 VL 模型;Python 直接导出;OmniQuant 量化
工具与应用
MNNConvert: 新增 dump pass 和 ConvPad 融合优化
mnncli 重构: 增加测试、Linux 构建、QNN 库下载
新增 Supertonic TTS、Sana diffusion、Sherpa-MNN TTS Demo 支持
MNN Chat (Android) : 模型市场优化、多图片输入、Debug 工具
⚡ 性能优化
CPU : LayerNorm/MatMul/二元广播优化;线程池开销降低;Attention Softmax 缓存优化
RISC-V Vector : 全面优化 pack/unpack、转置、数学、卷积、Softmax、CV 等核心函数
KleidiAI : 新增编译选项默认启用 fp32 深度卷积内核
🐛 缺陷修复
核心 : 修复 axis() 返回 nullptr 崩溃;新增 _SSE_MNNSoftmax 修复 AVX2 禁用时崩溃
GPU : 修复 iOS LLM 后台执行报错;修复 OpenCL/Vulkan/Metal 多个计算和稳定性问题
LLM : 修复 FastVLM/Eagle 导出问题;修复 CPU Attention 量化溢出
工具 : 修复 MNN2QNNModel、diffusion 导出、mnncli 等问题
📚 其他改进
文档 : 新增 NPU 导出和 Attention 量化参数文档
CI : 新增 GitHub/内部代码自动同步脚本;SmolVLM 测试;LLM 夜间测试
构建 : 默认启用 KleidiAI;RVV 宏跨平台支持;Linux mnncli 构建
🙏 致谢
@ihb2032 - 全面的 RISC-V Vector (RVV) 优化
@jxt1234 - 性能优化与缺陷修复
@HenryDen - KleidiAI 集成
@Juude - TTS/App 优化与 mnncli 重构
@vra - Supertonic TTS 支持
@zlaazlaa - Diffusion 导出修复
@bolun365 - MNN 库构建脚本
@JunKnows - OpenCL/Vulkan 修复
@rainyl - CMake 改进
@LudovicoYIN - 量化工具修复
@codefuturedalao - 代码优化
@EricMoin - 变量初始化修复
@Edward-Elric233 - MinGW 构建支持
@AliasJeff - 文档和 CLI 改进
@jules-ai - 文档修正
@vvverily - App 文案修正
📦 不兼容变更
quant_qkv 选项已废弃,请使用 attention_mode
完整变更日志 : https://github.com/alibaba/MNN/compare/3.3.0...3.4.0
MNN 3.4.0 Release Notes Release Date : February 2026
📌 Overview MNN 3.4.0 focuses on three core themes: Deepening GPU/QNN Backend Capabilities , Optimizing Attention Computation & Long-Context Memory Usage , and GPU Runtime Stability :
GPU/QNN Capability Enhancement : Vulkan backend adds LLM inference support with CoopMat matrix acceleration; Metal backend supports TensorAPI and Flash Attention; QNN backend extends to support Qwen3 series and VL models with direct Python export and OmniQuant quantization.
Attention & Long-Context Memory Optimization : CPU and Metal backends fully support Flash Attention; CPU supports KV Cache quantization; Prefix KV Cache support added; new unified attention_mode configuration option significantly reduces memory usage for long-context scenarios.
GPU Runtime Stability : Added iOS background detection mechanism - when app goes to background, GPU computation is rejected by the system, now correctly returns error code; fixed multiple GPU backend stability issues.
🚀 Highlights
Vulkan LLM Support : Vulkan backend now supports LLM inference, extending coverage to more Android devices
Vulkan CoopMat Acceleration : Vulkan supports Cooperative Matrix (CoopMat) instructions for significantly faster matrix multiplication
Metal TensorAPI Support : Metal backend supports TensorAPI, significant performance boost on M5 chips
Metal Flash Attention : Metal backend implements Flash Attention with significantly reduced memory usage
CPU Flash Attention : CPU backend supports Flash Attention with new unified attention_mode option
CPU KV Cache Quantization : CPU Attention supports KV Cache quantization to reduce memory usage
Prefix KV Cache : Support for Prefix KV Cache, improving long-text inference efficiency
Speculative Decoding Enhancement : Metal supports Eagle3 and Lookahead speculative decoding; Qwen3 Eagle3 weights provided
QNN Enhancement : QNN backend supports Qwen3 series and VL models; direct Python export; OmniQuant quantization support
Mixed Precision Quantization : llmexport supports mixed precision quantization via config file
llmexport Refactoring : Refactored model export logic with improved model abstraction
Loop Op GPU Optimization : OpenCL and Metal optimize Loop operator, pure GPU implementation without CPU fallback
RISC-V Vector (RVV) Optimization : Comprehensive intrinsic optimization across core operations
Vulkan Buffer FP16 : Full FP16 compute path for Vulkan buffer mode
KleidiAI Integration : Added compile option to enable KleidiAI fp32 depth-wise kernels by default
✨ New Features
LLM/VLM
Flash Attention : CPU and Metal backends support Flash Attention, Metal with three selectable modes
KV Cache Quantization : CPU Attention supports 8-bit asymmetric quantization for Query/Key/Value
Prefix KV Cache : Support for Prefix KV Cache, improving long-text inference efficiency
Speculative Decoding : Metal supports Eagle3 and Lookahead; Qwen3 Eagle3 weights provided
llmexport Refactoring : Refactored model export logic, supports mixed precision quantization and weight-free export
New attention_mode Option : Unified configuration for Attention behavior, deprecates quant_qkv
CPU: 0, 1, 2, 8, 9, 10 (default 8), controls Flash Attention and QKV quantization
GPU (Metal): 0, 8, 16 (default 8), controls Flash Attention implementation
Added Fun-Audio-Chat-8B support; LoRA supports cloning LayerNorm; llm_bench supports JSON output
GPU/QNN Backends
Vulkan : Added LLM inference support; CoopMat matrix acceleration; full FP16 buffer mode
Metal : TensorAPI support (M5 chip performance boost); iOS background detection; Loop op optimization
OpenCL : Pure GPU Loop operator implementation; mmap weight storage
QNN : Supports Qwen3 series and VL models; direct Python export; OmniQuant quantization
Tools & Apps
MNNConvert: Added dump pass and ConvPad fusion optimization
mnncli refactoring: Added tests, Linux build, QNN library download
Added Supertonic TTS, Sana diffusion, Sherpa-MNN TTS Demo support
MNN Chat (Android) : Model market optimization, multi-image input, Debug tools
⚡ Performance Optimizations
CPU : LayerNorm/MatMul/binary broadcast optimization; ThreadPool overhead reduction; Attention Softmax cache optimization
RISC-V Vector : Comprehensive optimization for pack/unpack, transpose, math, conv, Softmax, CV functions
KleidiAI : Added compile option to enable fp32 depth-wise kernels by default
🐛 Bug Fixes
Core : Fixed axis() nullptr crash; added _SSE_MNNSoftmax for AVX2 disabled crash
GPU : Fixed iOS LLM background execution error; fixed multiple OpenCL/Vulkan/Metal computation and stability issues
LLM : Fixed FastVLM/Eagle export issues; fixed CPU Attention quant overflow
Tools : Fixed MNN2QNNModel, diffusion export, mnncli issues
📚 Other Improvements
Documentation : Added NPU export and Attention quantization parameter docs
CI : Added GitHub/internal code auto-sync script; SmolVLM test; LLM nightly test
Build : KleidiAI enabled by default; RVV macro cross-platform support; Linux mnncli build
🙏 Acknowledgements We sincerely thank all external contributors for their valuable contributions to this release:
@ihb2032 - Comprehensive RISC-V Vector (RVV) optimization
@jxt1234 - Performance optimizations and bug fixes
@HenryDen - KleidiAI integration
@Juude - TTS/App optimization and mnncli refactoring
@vra - Supertonic TTS support
@zlaazlaa - Diffusion export fixes
@bolun365 - MNN library build script
@JunKnows - OpenCL/Vulkan fixes
@rainyl - CMake improvements
@LudovicoYIN - Quantization tool fixes
@codefuturedalao - Code optimizations
@EricMoin - Variable initialization fixes
@Edward-Elric233 - MinGW build support
@AliasJeff - Documentation and CLI improvements
@jules-ai - Typo corrections
@vvverily - App text corrections
📦 Breaking Changes
quant_qkv option is deprecated, please use attention_mode instead
Full Changelog : https://github.com/alibaba/MNN/compare/3.3.0...3.4.0
本文由 Claude Opus 4.5 协助生成 / This document was generated with assistance from Claude Opus 4.5