New
v0.9.3: Llama4, Gemma3, Qwen3, InternVL3, Qwen2.5-Omni
We will attend the AWS Summit Shanghai 2025 on June 20th! See you in Shanghai ๐
- Event info: https://aws.amazon.com/cn/events/summits/shanghai/
New features
- ๐ฅ InternVL2.5/InternVL3 model by @Kuangdd01 in #7258
- ๐ฅ Qwen2.5-Omni model by @Kuangdd01 in #7537
- ๐ฅ Llama 4 and Gemma 3 multimodal model by @hiyouga in #7273 and #7611
- ๐ฅ Official GPU docker image by @yzoaim in #8181
- ๐ฅ SGLang inference by @Qiaolin-Yu and @jhinpan in #7278
- GLM-4-0414 and GLM-Z1 model by @zRzRzRzRzRzRzR in #7695
- Kimi-VL model by @Kuangdd01 in #7719
- Qwen3 model by @hiyouga in #7885
- MiMo and MiMo-VL model by @Kuangdd01 in #7946 #8249
- SmolLM/SmolLM2 model by @akshatsehgal in #8050 #8220
- MiniCPM4 model by @LDLINGLINGLING in #8314
- Mistral-Small-3.1 model by @Kuangdd01 in #8335
- Add
scripts/eval_bleu_rouge.pyby @SnowFox4004 in #7419 - Add Muon optimizer by @tianshijing in #7749
- Support video/audio inference with vLLM by @hiyouga in #7566
- Support S3/GCS cloud data by @erictang000 in #7567
- Support vLLM-ascend by @leo-pony in #7739
- Support OmegaConf by @hiyouga in #7793
- Support early-stopping by @hiyouga in #7797
- Add
enable_thinkingargument for reasoning models by @hiyouga in #7928 - PyTorch-elastic and fault-tolerant launch by @hubutui in #8286
- Length Desensitization DPO (LD-DPO) by @amangup in #8362
New models
- Base models
- SmolLM/SmolLM2 (135M/360M/1.7B) ๐
- Qwen3 Base (0.6B/1.7B/4B/8B/14B/30B) ๐
- Gemma 3 (1B/4B/12B/27B) ๐๐ผ๏ธ
- MedGemma (4B) ๐๐ฉบ
- MiMo Base (7B) ๐
- Seed-Coder Base (8B) ๐โจ๏ธ
- Mistral-Small-3.1 Base (24B) ๐๐ผ๏ธ
- GLM-4-0414 Base (32B) ๐
- Llama 4 (109B/492B) ๐๐ผ๏ธ
- Instruct/Chat models
- SmolLM/SmolLM2 Instruct (135M/360M/1.7B) ๐๐ค
- MiniCPM4 (0.5B/8B) ๐๐ค
- Qwen3 (0.6B/1.7B/4B/8B/14B/32B/30B/235B) ๐๐ค๐ง
- Gemma 3 Instruct (1B/4B/12B/27B) ๐๐ค๐ผ๏ธ
- InternVL2.5/3 Instruct/MPO (1B/2B/8B/14B/38B/78B) ๐๐ค๐ผ๏ธ
- Qwen2.5-Omni (3B/7B) ๐๐ค๐ผ๏ธ๐
- MedGemma Instruct (4B/27B) ๐๐ค๐ฉบ
- MiMo SFT/RL (7B) ๐๐ค
- MiMo-VL SFT/RL (7B) ๐๐ค๐ผ๏ธ
- Hunyuan Instruct (7B) ๐๐ค
- Seed-Coder Instruct/Reasoning (8B) ๐๐ค๐ง โจ๏ธ
- GLM-4-0414/GLM-Z1 Instruct (9B/32B) ๐๐ค๐ง
- DeepSeek-R1-0528 (8B/671B) ๐๐ค๐ง
- Kimi-VL Instruct/Thinking (17B) ๐๐ค๐ง ๐ผ๏ธ
- Mistral-Small-3.1 Instruct (24B) ๐๐ค๐ผ๏ธ
- Qwen2.5-VL Instruct (32B) ๐๐ค๐ผ๏ธ
- Llama 4 Instruct (109B/492B) ๐๐ค๐ผ๏ธ
New datasets
- Preference datasets
- COIG-P (zh) ๐
Bug fix
- Fix add new tokens by @flashJd in #7253
- Fix ultrachat_200k dataset by @felladrin in #7259
- Add efficient 4D attention mask for neat packing by @BlackWingedKing in #7272
- Fix WSD lr scheduler by @x22x22 in #7304
- Fix position ids in neat packing by @BlackWingedKing in #7318
- Fix proxy setting in webui by @taoharry in #7332
- Improve entrypoint by @ENg-122 in #7345
- Fix ray destroy process group by @erictang000 in #7395
- Fix SGLang dependencies by @guoquan in #7432
- Upgrade docker package version by @rumichi2210 in #7442
- Update liger kernel for qwen2.5-vl by @xiaosu-zhu in #7453
- Fix lora on quant models by @GuoCoder in #7456
- Enable liger kernel for gemma3 by @kennylam777 in #7462
- Enable liger kernel for paligemma by @eljandoubi in #7466
- Add Swanlab lark notification by @Xu-pixel in #7481
- Fix gemma3 use cache attribute by @ysjprojects in #7500
- Fix pixtral plugin by @Kuangdd01 in #7505
- Fix KTO mismatch pair strategy by @himalalps in #7509
- Support
dataset_shardsby @aliencaocao in #7530 - Fix qwen2.5omni plugin by @Kuangdd01 in #7573 #7578 #7883
- Fix ppo trainer by @gechengze in #7576
- Fix workflow by @Shawn-Tao in #7635
- Support qwen2.5omni audio+video2text by @Kuangdd01 in #7638
- Upgrade deps for SGLang by @adarshxs in #7639
- Allow ray env setting by @erictang000 in #7647
- Fix CUDA warning on intel xpus by @jilongW in #7655
- Fix liger kernel patch by @danny980521 in #7660
- Fix rocm dockerfile by @fluidnumerics-joe in #7725
- Fix qwen2vl with neat packing by @GeoffreyChen777 in #7754
- Fix a constant by @AlphaBladez in #7765
- Fix autogptq for Gemma by @ddddng in #7786
- Fix internvl models by @Kuangdd01 in #7801 #7803 #7817 #8129
- Fix DeepSpeed ZeRO3 on moe models by @hiyouga in #7826 #7879
- Fix gradient checkpoint func for vit by @hiyouga in #7830
- Support S3 ray storage by @erictang000 in #7854
- Fix Kimi-VL attention by @Kuangdd01 in #7867
- Fix minicpm-o vllm inference by @hiyouga in #7870
- Unfreeze muiltimodal projector in freeze training by @zhaop-l in #7872
- Fix Qwen2.5-omni plugin by @hiyouga in #7875 #7962
- Add warp support link by @ericdachen in #7887
- Replace eos token for base model by @hiyouga in #7911
- Add
eval_on_each_datasetarg by @hiyouga in #7912 - Fix qwen3 loss by @hiyouga in #7923 #8109
Full Changelog: https://github.com/hiyouga/LLaMA-Factory/compare/v0.9.2...v0.9.3