v0.5.0: agentic RL rollout, prototypes for disaggregated async training & GenerativeRM, better rollout load balance & improved sglang+megatron/vlm support
Highlights
Agentic RL rollout interface [beta]
verl v0.5 introduces the AgentLoop abstraction that allows easy extension to custom rollout with tool/agent interactions. Server-based asynchronous rollout is adopted to efficiently utilize GPUs. verl provides a few example agent loop implementations including:
- Multi-turn conversations and tool calls
- LangGraph-based Agent
Please check the documentation for the system architecture design.
Disaggregated placement & async training [prototype]
verl v0.5 includes a community-contributed one-step-off async training recipe, with trainer and rollout deployed on disaggregated resources and off-policy model updates with staleness = 1. In a small scale experiment, the reference recipe provides 20-40% throughput gain compared to the on-policy baseline depending on the configuration. Please checkout the code and documentation for example configurations.
Remote generative reward models [prototype]
A recipe is provided as a prototype to demonstrate the recommended way to use generative reward models in verl. Documentation and code.
New features
- LoRA RL support for VLMs: https://github.com/volcengine/verl/pull/2182
- Better checkpoint manager support for SFT trainer https://github.com/volcengine/verl/pull/2292/
- Support rollout trajectory tracing and RolloutViewer with improved debug-ability and visualization
- Megatron with mbridge integration, which better supports hf model loading into megatron https://github.com/volcengine/verl/pull/2064
Important fixes & improvements
- Fixed an issue with FSDP2 state_dict memory usage caused by torch 2.6. Either using verl v0.5 or torch 2.7 avoids OOMs https://github.com/volcengine/verl/pull/2606
- Significantly reduced the overhead of vllm async server performance (v.s. vllm engine) https://github.com/volcengine/verl/pull/2246/
- Fixed sglang + Megatron TP16 https://github.com/volcengine/verl/pull/2336
- Improved SGLang + Megatron weight resharding by 10x https://github.com/volcengine/verl/pull/2418 and MoE weight resharding by 3x https://github.com/volcengine/verl/pull/2692
- Significant rollout load balancing for GRPO-like algorithms via repeating samples before dispatching them https://github.com/volcengine/verl/pull/2324
Breaking changes and deprecations
Full list: https://github.com/volcengine/verl/discussions/2270
Rollout
-
When generate_sequences with sampling params n>1, change DataProto repeat behavior:
- chunk-dispatch-repeat: DataProto is chunked and dispatched to rollout workers, then repeated in rollout workers.
- repeat-chunk-dispatch: DataProto is repeated by n in driver, then chunked and dispatched to rollout workers.
Switch from
chunk-dispatch-repeattorepeat-chunk-dispatch, this change may break almost all recipes and projects using verl GRPO as submodules. https://github.com/volcengine/verl/pull/2324
-
verl.workers.rollout.sglang_rollout.
AsyncSglangServeris now renamed asAsyncSGLangServer -
vllm <= v0.6 support is dropped
Multi-turn
- We are moving multi-turn supports from ChatScheduler to AgentLoop to improve usability. https://github.com/volcengine/verl/pull/2124
Megatron
- Megatron recomputation options are moved to
*.megatron.override_transformer_config. https://github.com/volcengine/verl/pull/2651 Default values are:
override_transformer_config:
recompute_granularity: null
recompute_modules:
- core_attn
recompute_method: null
recompute_num_layers: null
- Merged config
actor_rollout_ref.(actor, ref, rollout).profilertoactor_rollout_ref.profiler
What's Changed
Trainer & FSDP
- [fsdp] fix: Change the data in the update_actor function from to.('cpu') to to.(get_device_id()) by @Keilo001 in https://github.com/volcengine/verl/pull/2477
- [fsdp] fix: vlm dynamic batch & unify dynamic batch api by @hiyouga in https://github.com/volcengine/verl/pull/2524
- [fsdp] fix: change geo3k model name from non-vl to vl by @nanjiangwill in https://github.com/volcengine/verl/pull/2555
- [trainer, recipe] feat: add support for external generative reward models by @yyDing1 in https://github.com/volcengine/verl/pull/2121
- [trainer] fix: fix split placement by @vermouth1992 in https://github.com/volcengine/verl/pull/2227
- [trainer, vllm] feat: add lora exclude_modules to support VL model lora training by @Cccei000 in https://github.com/volcengine/verl/pull/2182
- [trainer] fix: pre-commit broken by #2354 by @ETOgaosion in https://github.com/volcengine/verl/pull/2358
- [trainer, cfg] feat: add BaseConfig for all dataclass configs. Introduce dataclass for algorithm related configs by @eric-haibin-lin in https://github.com/
- [trainer] fix: Use safe masked mean/sum to handle NaN values outside the mask by @Yangruipis in https://github.com/volcengine/verl/pull/2377
- [trainer, data] feat: Dynamic Data Generation by @jwong8314 in https://github.com/volcengine/verl/pull/2312/verl/pull/2433
- [trainer] fix: use .keys() to check 'response_mask' in TensorDict by @askender in https://github.com/volcengine/verl/pull/2491
- [trainer] fix: Allow FSDP2 when doing strategy check by @HollowMan6 in https://github.com/volcengine/verl/pull/2497
- [trainer] refactor: no need to call load_reward_manager in compute_reward_async by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2557
- [trainer, fsdp, vllm, recipe] feat: one step off async training recipe by @imh966 in https://github.com/volcengine/verl/pull/2231
- [trainer] fix: maybe_filter_out_long_prompts on image and video by @firefighter-eric in https://github.com/volcengine/verl/pull/2553
- [trainer] refactor: Training Engine Interface and Development Plan by @ZihengJiang in https://github.com/volcengine/verl/pull/1977
- [trainer] feat: Add FSDPCheckpointManager for SFTtrainer, support resume training, manage the number of CKPTS in keep by @Pursuer-Hsf in https://github.com/volcengine/verl/pull/2292/
Rollout & SGLang
- [rollout] feat: add agent loop by @wuxibin89 in https://github.com/volcengine/verl/pull/2124
- [rollout] feat: add zeromq vllm distributed executor by @wuxibin89 in https://github.com/volcengine/verl/pull/2246
- [BREAKING][rollout] refactor: drop vllm v0.5.4 and v0.6.3 support by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2257
- [rollout] feat: Allow customization of async server class by @ultmaster in https://github.com/volcengine/verl/pull/2326
- [rollout] fix: fix hf rollout and add single gpu test by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2371
- [BREAKING][rollout] feat: repeat DataProto when n>1 in driver instead of rollout workers by @wuxibin89 in https://github.com/volcengine/verl/pull/2324
- [misc] feat: trace rollout generation and tool calls using weave by @chenhaiq in https://github.com/volcengine/verl/pull/2345
- [cfg] refactor: make the rollout & ref configs more modular by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2410
- [perf] feat: add range tag to start/stop profile; clean actor_rollout_ref.profiler by @davidmlw in https://github.com/volcengine/verl/pull/2456
- [rollout] feat: support mlflow in rollout trace by @chenhaiq in https://github.com/volcengine/verl/pull/2440
- [rollout] feat: add ReactAgentLoop based on LangGraph by @wuxibin89 in https://github.com/volcengine/verl/pull/2463
- [rollout] fix: fix bug for remax when the rollout mode is async by @none0663 in https://github.com/volcengine/verl/pull/2574
- [tool] chore: introduce RolloutViewer TUI tools by @Yangruipis in https://github.com/volcengine/verl/pull/2469
- [rollout,vllm] fix: A major issue in random sampling of vllm engine by @guanning03 in https://github.com/volcengine/verl/pull/2646
- [tool] chore: Add log for AsyncRolloutRequest ID, and rollout viewr to support request id display and search by @Hecate0821 in https://github.com/volcengine/
- [rollout] fix: use flashattn3 backend in sglang to avoid error in tool call by @chenhaiq in https://github.com/volcengine/verl/pull/2244
- [rollout] fix: Make
free_cache_engineoption workable in latest vLLM/SGLang by @HollowMan6 in https://github.com/volcengine/verl/pull/1464 - [rollout] fix: #1646 stop words for sglang rollout by @linxxx3 in https://github.com/volcengine/verl/pull/1991
- [sglang, rollout] refactor: use torch.Tensor in async rollout schemas by @nanjiangwill in https://github.com/volcengine/verl/pull/2362
- [rollout] fix: sglang async fail with Multi-stage Awake feature by @chenhaiq in https://github.com/volcengine/verl/pull/2365
- [sglang] feat: Add multi-interaction registry support and testing by @SwordFaith in https://github.com/volcengine/verl/pull/2184
- [sglang] feat: Repeat sampling parameter n into requests of GRPO in SGLang by @zhaochenyang20 in https://github.com/volcengine/verl/pull/2258
- [sglang,tool] feat: Add support for tools that generate multimodal data by @nanjiangwill in https://github.com/volcengine/verl/pull/2146
- [sglang] fix: only wake up weights on infer_tp 0 by @zhaochenyang20 in https://github.com/volcengine/verl/pull/2403
- [sglang] fix: Import Error in the latest sglang by @yyDing1 in https://github.com/volcengine/verl/pull/2275
- [sglang] fix: Fix qwen2vl weight keys issue by @hebiao064 in https://github.com/volcengine/verl/pull/2434
- [sglang] fix: Only flush cache on TP rank=0. by @SuperCB in https://github.com/volcengine/verl/pull/2455
- [sglang] feat: update weights in batch with FSDP by @zhaochenyang20 in https://github.com/volcengine/verl/pull/2559
- [sglang] fix: adding missing param for sgl async unit test by @zhaochenyang20 in https://github.com/volcengine/verl/pull/2561
- [sglang] fix: update response handling and scoring method in GSM8K interaction by @aaronyeeio in https://github.com/volcengine/verl/pull/2428
- [sglang] fix: rename Sglang to SGLang following SGLang's fashion by @zhaochenyang20 in https://github.com/volcengine/verl/pull/2672
- [sglang] fix: Bug in megatron+sglang TP16 update_weights. by @SuperCB in https://github.com/volcengine/verl/pull/2336
- [sglang, megatron, perf] feat: speed up megatron sglang weight update by 10x by @Yangruipis in https://github.com/volcengine/verl/pull/2418
- [megatron] fix: wrong response_mask for megatron + sglang mutli-turn by @Yangruipis in https://github.com/volcengine/verl/pull/2543
Megatron
- [megatron] feat: add megatron memory log by @ETOgaosion in https://github.com/volcengine/verl/pull/2272
- [megatron] feat: use mbridge as megatron adaptor by @ISEEKYAN in https://github.com/volcengine/verl/pull/2064
- [megatron] fix: optimizer scheduler misalignment with FSDP by @ETOgaosion in https://github.com/volcengine/verl/pull/2303
- [cfg] refactor: split fsdp/megatron specific configs, consolidate shared ones for reward_model and critic by @eric-haibin-lin in https://github.com/volcengine
- [megatron] feat: fused kernel lightweight by @ISEEKYAN in https://github.com/volcengine/verl/pull/2210
- [megatron] feat: allow override DistributedDataParallelConfig by @ETOgaosion in https://github.com/volcengine/verl/pull/2523
- [data, megatron] feat: add dynamic batching computational workload balance by @conver334 in https://github.com/volcengine/verl/pull/2452
- [megatron] feat: support distributed megatron model converter and merger by @Yangruipis in https://github.com/volcengine/verl/pull/2281
- [cfg] refactor: add flatten megatron trainer config generation and verification script by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2582
- [BREAKING][megatron] refactor: activation checkpointing APIs by @ETOgaosion in https://github.com/volcengine/verl/pull/2651
- [megatron] fix: CUDA_DEVICE_MAX_CONNECTIONS not taking effect by @ETOgaosion in https://github.com/volcengine/verl/pull/2687
Hardware
- [hardware] feat: support ray actor sharing situation on ASCEND NPU by @FightingZhen in https://github.com/volcengine/verl/pull/2341
- [Hardware] feat: Support AMD (ROCMm Kernel) - Update Dockerfile/Docker Image by @yushengsu-thu in https://github.com/volcengine/verl/pull/2390
- [hardware] fix: enable sleep mode on ASCEND NPU by @as12138 in https://github.com/volcengine/verl/pull/2459
- [hardward] chore: Enable Generation of Wheel File During Docker Build by @rhiremat in https://github.com/volcengine/verl/pull/2332
Misc fixes
- [ckpt] feat: support esi by @plutoZZZZ in https://github.com/volcengine/verl/pull/2192
- [model] fix: separate minicpmo data by @hiyouga in https://github.com/volcengine/verl/pull/2212
- [misc] chore: pin transformers under 4.53 by @hiyouga in https://github.com/volcengine/verl/pull/2241
- [worker] fix: OOM on first iteration in multi-turn RL by @zTonyZhao in https://github.com/volcengine/verl/pull/2253
- [algo] fix: correctly aggregate kl metrics in PPO actor by @0x404 in https://github.com/volcengine/verl/pull/2259
- [recipe] feat: add retool recipe by @wuxibin89 in https://github.com/volcengine/verl/pull/2233
- [cfg] fix: Security Enhancement Block Dangerous Modules in Sandbox Environment by @none0663 in https://github.com/volcengine/verl/pull/2170
- [cfg] chore: add non-negative expected_len assertion by @LeavesLei in https://github.com/volcengine/verl/pull/2330
- [algo] feat: mask out observation token in GAE by @wuxibin89 in https://github.com/volcengine/verl/pull/2337
- [tool] fix: avoid exception when sandbox return None by @chenhaiq in https://github.com/volcengine/verl/pull/2346
- [perf] feat: support entropy checkpointing without rmpad or sp by @FightingZhen in https://github.com/volcengine/verl/pull/2342
- [ckpt] fix: edit esi doc by @plutoZZZZ in https://github.com/volcengine/verl/pull/2354
- [docker] refactor: Migrate images to verlai, support latest flash attention and newer CUDA versions in future by @ETOgaosion in https://github.com/volcengine/verl/pull/2085volcengine/verl/pull/2147
- [data] feat: add interface for user-defined curriculum sampler by @frrad in https://github.com/volcengine/verl/pull/2314
- [cfg] fix: pickleing error in multiprocessing in the reward_fn by @none0663 in https://github.com/volcengine/verl/pull/2239
- [ray] refactor: Seperate the constants into different file by @YeonwooSung in https://github.com/volcengine/verl/pull/2025
- [misc] refactor: replace pkg_resources with importlib.metadata by @askender in https://github.com/volcengine/verl/pull/2392
- [tool] fix: Add MCP usage documentation by @AlecHenx in https://github.com/volcengine/verl/pull/2261
- [cfg] refactor: make actor config more modular by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2379
- [misc] fix: huggingface model config max_position_embeddings assertion for model with extended context length by @Wangmerlyn in https://github.com/volcengine/verl/pull/737
- [data] refactor: move sampler api to experimental by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2381
- [perf] feat: add npu profiler for FSDP backend by @tongtong0613 in https://github.com/volcengine/verl/pull/2194
- [misc] refactor: Replace deepcopy with tensor.clone by @ji-huazhong in https://github.com/volcengine/verl/pull/2442
- [misc] fix: add *.yaml to pyproject due to modular config by @nanjiangwill in https://github.com/volcengine/verl/pull/2468
- [misc] feat: add py.typed file to
verl/by @frrad in https://github.com/volcengine/verl/pull/2467 - [env] feat: upgrade tensordict version by @vermouth1992 in https://github.com/volcengine/verl/pull/2460
- [docker] feat: provide images with deepep by @ETOgaosion in https://github.com/volcengine/verl/pull/2480
- [training_utils] feat: log_generations_to_swanlab use table by @Zeyi-Lin in https://github.com/volcengine/verl/pull/2489
- [env] feat: safely bump py version to 3.10 by @Tavish9 in https://github.com/volcengine/verl/pull/2421
- [BUG] fix bug for #2506, when passing as response_mask to policy_loss_fn by @none0663 in https://github.com/volcengine/verl/pull/2513
- [single_controller] fix: replace unittest.mock.patch with context manager for env var handling by @PeterSH6 in https://github.com/volcengine/verl/pull/2498
- [recipe] fix: DAPO rewards using sandbox fusion by @HollowMan6 in https://github.com/volcengine/verl/pull/2496
- [cfg] refactor: support +extra.any_key usage for the base dataclass config in verl by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2502
- [ray] refactor: Use public method to get node IP by @kevin85421 in https://github.com/volcengine/verl/pull/2521
- [env] fix: bump tensordict to 0.9.1 by @ultmaster in https://github.com/volcengine/verl/pull/2541
- [data] fix: Add missing init files in verl experimental data folders by @JoostvDoorn in https://github.com/volcengine/verl/pull/2548
- [ray] fix: strip [] for ipv6 address by @wuxibin89 in https://github.com/volcengine/verl/pull/2545
- [tool] fix: correctly convert 'None' to null in sandbox fusion _process_single_case by @mathewjhan in https://github.com/volcengine/verl/pull/2409
- [training_utils] fix: uneven support in split by @ultmaster in https://github.com/volcengine/verl/pull/2560
- [perf] feat: Clip gsm8k solution string to optimize reward calculation by @PopSoda2002 in https://github.com/volcengine/verl/pull/2568
- set use_kl_in_reward=True in reinforce_plus_plus by @Titanpku in https://github.com/volcengine/verl/pull/2580
- [cfg] feat: add critic config class by @eric-haibin-lin in https://github.com/volcengine/verl/pull/2583
- [tool] fix: supports variable arguments for marked_timer by @tardis-key in https://github.com/volcengine/verl/pull/2576
- [single_controller] fix: padding for kwargs by @ShareLer in https://github.com/volcengine/verl/pull/2585
- [docker] fix: downgrade TransformerEngine version 2.2.1 to allow mcore image using rope fusion and provide another set of v0.5 image by @ETOgaosion in https://github.com/volcengine/verl/pull/2611volcengine/verl/pull/2292
- [recipe] feat: add QWen 30b moe dapo script that can run on a single 80GB node by @vermouth1992 in https://github.com/volcengine/verl/pull/2645verl/pull/2636
- [perf] feat: mistral and gemma3_text mfu compute support by @xihuai18 in https://github.com/volcengine/verl/pull/2622
- [misc] fix: fix prompt and response key in gemma7b example by @apeforest in https://github.com/volcengine/verl/pull/2610
- [data, recipe] fix: remove redundant json parsing by @zhxieml in https://github.com/volcengine/verl/pull/2671
New Contributors
Welcome new contributors to the verl community! @rhiremat @LeavesLei @diqiuzhuanzhuan @frrad @shuyhere @askender @Tavish9 @Wangmerlyn @SuperCB @tongtong0613 @jwong8314 @ji-huazhong @Keilo001 @conver334 @JoostvDoorn @mathewjhan @PopSoda2002 @rudeigerc @Titanpku @firefighter-eric @meituan-search @xihuai18 @tardis-key @ZihengJiang @Pursuer-Hsf @beep-bebop @aaronyeeio @Hecate0821 @apeforest @zhxieml
Full Changelog: https://github.com/volcengine/verl/compare/v0.4.1...v0.5.0