🌏 RLVR-World: Training World Models with Reinforcement Learning

1School of Software, BNRist, Tsinghua University, 2Zhili College, Tsinghua University
# Corresponding Author

Abstract

World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.

Method Overview

Surrogate vs. Direct Optimization

World models are typically trained using surrogate objectives such as maximum likelihood estimation (MLE), which misalign with task-specific goals of state transition prediction. Reinforcement learning with verifiable rewards (RLVR) offers a promising emerging approach to tuning pre-trained models for direct optimization toward the target task.

RLVR-World Framework

We introduces RLVR-World, a unified framework where (1) world models across various modalities are unified under a sequence modeling formulation, and (2) task-specific prediction metrics serve as verifiable rewards. (Top) Language-based world models predict verbal state transitions in response to verbal actions. (Bottom) Video-based world models, equipped with a visual tokenizer, predict future visual observations conditioned on action vectors.

Evaluating Language World Models

Beyond the success of large language models (LLMs) in math and code domains, we introduce the world modeling task as a new testbed for RLVR in LLMs. This task, predicting the transition of verbal world states, naturally lends itself to using prediction accuracy as a verifiable reward.

Text Game State Prediction

We evaluate on a dataset of text game state transitions and RLVR improves a 1.5B LLM to better serve as text-based world simulators, achieving +30.7% accuracy and rivaling the overall performance of GPT-4.

Web Page State Prediction

We further evaluate on more realistic web navigation scenarios, using a web page state transition dataset collected from the WebArena benchmark. The world model of the Internet can also be enhanced substantially (+15.1% F1 score) by RLVR.

Application: Model Predictive Control for Web Agents 💻️

RVLR-trained world models finally enable more powerful web agents, with relatively +18.4% improvements on WebArena success rates.

Evaluating Video World Models

We pioneer the RLVR fine-tuning of autoregressive video world models by directly measuring and optimizing perceptual metrics of decoded predicted frames, offering analyses and insights into broader generative models beyond the scope of reasoning models.

Robot Manipulation Trajectory Prediction

We use the RT-1 robotic manipulation dataset. RLVR bridges the gap between pre-training objectives and visual prediction metrics, leading to more accurate predictions (relatively +9.2% LPIPS), improved training efficiency (4 orders faster), and reduced artifacts such as repetition.

Video Samples

Ground truth

RLVR-World

Base model

Ground truth

RLVR-World

Base model

Repetition Reduction

Ground truth

RLVR-World (No repetition)

Base model (Repetition)

Ground truth

RLVR-World (No repetition)

Base model (Repetition)

Application: Real2Sim Policy Evaluation 🤖

Video world models can serve as real-world simulators for policy evaluation. We evaluate four policy checkpoints from RT-1 and RT-1-X, on six tasks involving opening and closing top, middle, and bottom drawers. Compared to handcrafted SIMPLER simulators, video world models yield smaller discrepancies between real and simulated success rates, suggesting world models as a scalable approach to bridging the sim-to-real gap. Among the video world models, RLVR further improves upon the base model.

Open Drawer

RT-1 Converged Checkpoint

RT-1 Begin Checkpoint

Close Drawer

RT-1 Converged Checkpoint

RT-1 Begin Checkpoint

Related Research

You may also be interested in iVideoGPT, the architecture of our base video world model. It features a compressive tokenization for visual observations and an autoregressive transformer. Its scalability enables interactive world models pre-trained on millions of human and robotic manipulation trajectories.

Citation

@article{wu2025rlvr,
    title={RLVR-World: Training World Models with Reinforcement Learning}, 
    author={Jialong Wu and Shaofeng Yin and Ningya Feng and Mingsheng Long},
    journal={arXiv preprint arXiv:2505.13934},
    year={2025},
}