World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.
World models are typically trained using surrogate objectives such as maximum likelihood estimation (MLE), which misalign with task-specific goals of state transition prediction. Reinforcement learning with verifiable rewards (RLVR) offers a promising emerging approach to tuning pre-trained models for direct optimization toward the target task.
We introduces RLVR-World, a unified framework where (1) world models across various modalities are unified under a sequence modeling formulation, and (2) task-specific prediction metrics serve as verifiable rewards. (Top) Language-based world models predict verbal state transitions in response to verbal actions. (Bottom) Video-based world models, equipped with a visual tokenizer, predict future visual observations conditioned on action vectors.
Beyond the success of large language models (LLMs) in math and code domains, we introduce the world modeling task as a new testbed for RLVR in LLMs. This task, predicting the transition of verbal world states, naturally lends itself to using prediction accuracy as a verifiable reward.
We evaluate on a dataset of text game state transitions and RLVR improves a 1.5B LLM to better serve as text-based world simulators, achieving +30.7% accuracy and rivaling the overall performance of GPT-4.
We further evaluate on more realistic web navigation scenarios, using a web page state transition dataset collected from the WebArena benchmark. The world model of the Internet can also be enhanced substantially (+15.1% F1 score) by RLVR.
RVLR-trained world models finally enable more powerful web agents, with relatively +18.4% improvements on WebArena success rates.
We pioneer the RLVR fine-tuning of autoregressive video world models by directly measuring and optimizing perceptual metrics of decoded predicted frames, offering analyses and insights into broader generative models beyond the scope of reasoning models.
We use the RT-1 robotic manipulation dataset. RLVR bridges the gap between pre-training objectives and visual prediction metrics, leading to more accurate predictions (relatively +9.2% LPIPS), improved training efficiency (4 orders faster), and reduced artifacts such as repetition.
Ground truth
RLVR-World
Base model
Ground truth
RLVR-World
Base model
Ground truth
RLVR-World (No repetition)
Base model (Repetition)
Ground truth
RLVR-World (No repetition)
Base model (Repetition)
Video world models can serve as real-world simulators for policy evaluation.
We evaluate four policy checkpoints from RT-1 and RT-1-X, on six tasks involving opening and closing top, middle, and bottom drawers.
Compared to handcrafted SIMPLER simulators, video world models yield smaller discrepancies between real and simulated success rates, suggesting world models as a scalable approach to bridging the sim-to-real gap.
Among the video world models, RLVR further improves upon the base model.
RT-1 Converged Checkpoint
RT-1 Begin Checkpoint
RT-1 Converged Checkpoint
RT-1 Begin Checkpoint
You may also be interested in iVideoGPT, the architecture of our base video world model. It features a compressive tokenization for visual observations and an autoregressive transformer. Its scalability enables interactive world models pre-trained on millions of human and robotic manipulation trajectories.
@article{wu2025rlvr,
title={RLVR-World: Training World Models with Reinforcement Learning},
author={Jialong Wu and Shaofeng Yin and Ningya Feng and Mingsheng Long},
journal={arXiv preprint arXiv:2505.13934},
year={2025},
}