World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.
iVideoGPT is a generic and efficient world model architecture: (a) Compressive tokenization utilizes a conditional VQGAN that discretizes future frames conditioned on context frames to handle temporal redundancy, reducing the number of video tokens asymptotically by 16x. (b) An autoregressive transformer integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, enabling interactive agent experience through next-token prediction.
iVideoGPT is scalable for action-free video pre-training on a mixture of 1.5 million robotic and human manipulation trajectories. The pre-trained iVideoGPT serves as a versatile foundation that can be adapted into interactive world models for various downstream tasks. These include action-conditioned video prediction, visual planning, and visual model-based RL.
Our pre-training dataset, Open X-Embodiment, is a diverse collection of robot learning datasets from a variety of robot embodiments, scenes, and tasks. These datasets are highly heterogeneous but can be easily unified in the action-free video prediction task.
In each pair, the left image is the ground truth and the right image is the predicted video. The red border indicates the context frame and the green border means the frame was predicted. Hover to zoom in.
We analyze the zero-shot video prediction capability of the large-scale pretrained iVideoGPT on the unseen BAIR dataset. Interestingly, we observe that iVideoGPT, without fine-tuning, can predict natural movements of a robot gripper—albeit predicting into another one originally from our pre-training dataset. This generalization issue can be resolved with simple tokenization adaptation, allowing the transformer to transfer its pre-trained knowledge and predict movements for the new robot type. This property is particularly important for scaling GPT-like transformers to large sizes, enabling lightweight alignment across domains while keeping the transformer itself intact.
While a control-centric benchmark that evaluates video prediction models for visual model-predictive control (MPC) observed that excellent perceptual metrics do not always correlate with effective control performance, iVideoGPT outperforms all baselines in two RoboDesk tasks with a large margin and achieves comparable average performance to the strongest model.
Hover to zoom in.
Leveraging iVideoGPT as interactive world models, we have developed a model-based RL method adapted from MBPO, which augments the replay buffer with synthetic rollouts to train a standard actor-critic RL algorithm (our implementation builds upon DrQ-v2). Power world models highlights the opportunity to eliminate the need for latent imagination—a common strategy used in advanced MBRL systems to train policies on rollouts of latent states within world models. Latent imagination facilitates more efficient and accurate rollouts, but complicates algorithmic designs by tightly coupling model and policy learning.
On six robotic manipulation tasks from Meta-World, our model-based algorithm not only remarkably improves the sample efficiency over its model-free counterpart but also matches or exceeds the performance of DreamerV3. To our knowledge, this marks the first successful application of MBPO to visual continuous control tasks.
True and predicted rewards are labeled at the top left corner. Hover to zoom in.
You may also be interested in ContextWM, a pioneering work that leverages real-world videos for world model pre-training, enabling sample-efficient model-based RL for visual control tasks across various domains. This project also includes unified PyTorch implementations of DreamerV2 and APV.
@article{wu2024ivideogpt,
title={iVideoGPT: Interactive VideoGPTs are Scalable World Models},
author={Jialong Wu and Shaofeng Yin and Ningya Feng and Xu He and Dong Li and Jianye Hao and Mingsheng Long},
journal={arXiv preprint arXiv:2405.15223},
year={2024},
}