NVIDIA Megatron Core 0.15.0

Features
- Performance
  - Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
  - Use new TE interface for user buffers (MR !3886)
  - Add CPU activation offloading via TE (MR !4286)
  - Add setting to support Adam or AdamW optimizer (MR !3866)
- MoE
  - Add DTensor support for EP and DSv3 modules (MR !3955)
  - Add HybridEP backend to Flex Dispatcher (PR !2176)
  - Implement NVFP4 Zero Padding for MoE (PR !1985)
  - Compute shared experts before router (MR !4068)
  - Enable bias in expert MLP (MR !3858)
- Model support
  - Add YaRN support for GPT-OSS (MR !4044)
  - Add FP8 init for MTP (MR !3958)
  - Add fp8_dpa option for FP8 scaling (MR !4053)
- FSDP
  - Enable joint training of parallel modules (MR !3850)
- Inference
  - Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
  - Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
  - Integrate unified memory for dynamic inference context (MR !3985)
- Post-training
  - Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
  - Enable KD support with hybrid training loop (MR !4021)
  - Add ModelOpt pruning example (MR !4022)
- RL
  - Add importance sampling and partial rollouts to Megatron RL (MR !4000)
  - Add sequence packing for RL (MR !4191)
- Ease of use
  - Handle CUDA absence during import (MR !4120)
  - Enable SWA mixing with attention (MR !3855)
Bug fixes
- Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
- Fix loss mask cloning to prevent incorrect updates (MR !4164)
- Fix metadata loss in checkpoints (MR !4182)
- Fix FSDP grad accum fusion support (MR !4018)
- Fix non-TE optimizer checkpoint issue (MR !3931)
- Fix BERT virtual pipeline parallelism (MR !3993)
- Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
Known issues
New Contributors
- @marksverdhei made their first contribution in #1980
- @Skylion007 made their first contribution in #2047
- @azzhipa made their first contribution in 5db6704
- @vicoooo26 made their first contribution in 5db6704
- @A-transformer made their first contribution in e002b5c
- @chaitanyadwivedii made their first contribution in 20b3954

We'd like to thank all our external contributors whose work was merged in this release:

External Contributor Acknowledgements
- Fix ImportError and NameError in examples/run_simple_mcore_train_loop.py by @marksverdhei in #1980
- Optimizer refactor: clean up public get_megatron_optimizer interface by @Skylion007 in #2047
- Typo fixes from community with co-authors @vicoooo26, @azzhipa, @A-transformer in 5db6704 and e002b5c
- Fix router input jitter dtype by @chaitanyadwivedii in 20b3954

Note: Some contributions came through internal MRs and use commit hashes instead of PR numbers. We are now GitHub first so all PRs moving forward will be tested and merged in public.

Features
- Performance
  - Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
  - Use new TE interface for user buffers (MR !3886)
  - Add CPU activation offloading via TE (MR !4286)
  - Add setting to support Adam or AdamW optimizer (MR !3866)
- MoE
  - Add DTensor support for EP and DSv3 modules (MR !3955)
  - Add HybridEP backend to Flex Dispatcher (PR !2176)
  - Implement NVFP4 Zero Padding for MoE (PR !1985)
  - Compute shared experts before router (MR !4068)
  - Enable bias in expert MLP (MR !3858)
- Model support
  - Add YaRN support for GPT-OSS (MR !4044)
  - Add FP8 init for MTP (MR !3958)
  - Add fp8_dpa option for FP8 scaling (MR !4053)
- FSDP
  - Enable joint training of parallel modules (MR !3850)
- Inference
  - Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
  - Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
  - Integrate unified memory for dynamic inference context (MR !3985)
- Post-training
  - Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
  - Enable KD support with hybrid training loop (MR !4021)
  - Add ModelOpt pruning example (MR !4022)
- RL
  - Add importance sampling and partial rollouts to Megatron RL (MR !4000)
  - Add sequence packing for RL (MR !4191)
- Ease of use
  - Handle CUDA absence during import (MR !4120)
  - Enable SWA mixing with attention (MR !3855)
Bug fixes
- Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
- Fix loss mask cloning to prevent incorrect updates (MR !4164)
- Fix metadata loss in checkpoints (MR !4182)
- Fix FSDP grad accum fusion support (MR !4018)
- Fix non-TE optimizer checkpoint issue (MR !3931)
- Fix BERT virtual pipeline parallelism (MR !3993)
- Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
Known issues
New Contributors
- @marksverdhei made their first contribution in #1980
- @Skylion007 made their first contribution in #2047
- @azzhipa made their first contribution in 5db6704
- @vicoooo26 made their first contribution in 5db6704
- @A-transformer made their first contribution in e002b5c
- @chaitanyadwivedii made their first contribution in 20b3954

We'd like to thank all our external contributors whose work was merged in this release:

External Contributor Acknowledgements
- Fix ImportError and NameError in examples/run_simple_mcore_train_loop.py by @marksverdhei in #1980
- Optimizer refactor: clean up public get_megatron_optimizer interface by @Skylion007 in #2047
- Typo fixes from community with co-authors @vicoooo26, @azzhipa, @A-transformer in 5db6704 and e002b5c
- Fix router input jitter dtype by @chaitanyadwivedii in 20b3954

Note: Some contributions came through internal MRs and use commit hashes instead of PR numbers. We are now GitHub first so all PRs moving forward will be tested and merged in public.

Megatron-LM

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

yt-dlp