iVideoGPT

Abstract

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.

iVideoGPT

Architecture

iVideoGPT is a generic and efficient world model architecture: (a) Compressive tokenization utilizes a conditional VQGAN that discretizes future frames conditioned on context frames to handle temporal redundancy, reducing the number of video tokens asymptotically by 16x. (b) An autoregressive transformer integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, enabling interactive agent experience through next-token prediction.

Pre-training & Fine-tuning

iVideoGPT is scalable for action-free video pre-training on a mixture of 1.5 million robotic and human manipulation trajectories. The pre-trained iVideoGPT serves as a versatile foundation that can be adapted into interactive world models for various downstream tasks. These include action-conditioned video prediction, visual planning, and visual model-based RL.

Video Prediction

Video Samples: Open X-Embodiment (Action-free)

Our pre-training dataset, Open X-Embodiment, is a diverse collection of robot learning datasets from a variety of robot embodiments, scenes, and tasks. These datasets are highly heterogeneous but can be easily unified in the action-free video prediction task.

In each pair, the left image is the ground truth and the right image is the predicted video. The red border indicates the context frame and the green border means the frame was predicted. Hover to zoom in.

Open X-Embodiment (Goal-conditioned)

A sequence of tokens provides a flexible way to specify tasks, inputs, and outputs. A variant of iVideoGPT for goal-conditioned video prediction is simply achieved by rearranging the frame sequence, while keeping the architecture and training procedure consistent.

In each pair, the left image is the ground truth and the right image is the predicted video. The last frame of the ground truth is specified as the goal observation. Hover to zoom in.

Expand more samples on BAIR Robot Pushing and RoboNet

BAIR Robot Pushing (Action-free)

BAIR Robot Pushing (Action-conditioned)

RoboNet (Action-conditioned, 64x64)

RoboNet (Action-conditioned, 256x256)

Zero-shot Prediction

We analyze the zero-shot video prediction capability of the large-scale pretrained iVideoGPT on the unseen BAIR dataset. Interestingly, we observe that iVideoGPT, without fine-tuning, can predict natural movements of a robot gripper—albeit predicting into another one originally from our pre-training dataset. This generalization issue can be resolved with simple tokenization adaptation, allowing the transformer to transfer its pre-trained knowledge and predict movements for the new robot type. This property is particularly important for scaling GPT-like transformers to large sizes, enabling lightweight alignment across domains while keeping the transformer itself intact.

Visual Planning

While a control-centric benchmark that evaluates video prediction models for visual model-predictive control (MPC) observed that excellent perceptual metrics do not always correlate with effective control performance, iVideoGPT outperforms all baselines in two RoboDesk tasks with a large margin and achieves comparable average performance to the strongest model.

Video Samples

Hover to zoom in.

Visual Model-based RL

Leveraging iVideoGPT as interactive world models, we have developed a model-based RL method adapted from MBPO, which augments the replay buffer with synthetic rollouts to train a standard actor-critic RL algorithm (our implementation builds upon DrQ-v2). Power world models highlights the opportunity to eliminate the need for latent imagination—a common strategy used in advanced MBRL systems to train policies on rollouts of latent states within world models. Latent imagination facilitates more efficient and accurate rollouts, but complicates algorithmic designs by tightly coupling model and policy learning.

Sample Efficiency

On six robotic manipulation tasks from Meta-World, our model-based algorithm not only remarkably improves the sample efficiency over its model-free counterpart but also matches or exceeds the performance of DreamerV3. To our knowledge, this marks the first successful application of MBPO to visual continuous control tasks.

Video Samples

True and predicted rewards are labeled at the top left corner. Hover to zoom in.

Related Research

You may also be interested in ContextWM, a pioneering work that leverages real-world videos for world model pre-training, enabling sample-efficient model-based RL for visual control tasks across various domains. This project also includes unified PyTorch implementations of DreamerV2 and APV.

Citation

@inproceedings{wu2024ivideogpt,
    title={iVideoGPT: Interactive VideoGPTs are Scalable World Models}, 
    author={Jialong Wu and Shaofeng Yin and Ningya Feng and Xu He and Dong Li and Jianye Hao and Mingsheng Long},
    booktitle={Advances in Neural Information Processing Systems},
    year={2024},
}

🌏 iVideoGPT: Interactive VideoGPTs are Scalable World Models

NeurIPS 2024

Abstract

iVideoGPT

Architecture

Pre-training & Fine-tuning

Video Prediction

Video Samples: Open X-Embodiment (Action-free)

Open X-Embodiment (Goal-conditioned)

BAIR Robot Pushing (Action-free)

BAIR Robot Pushing (Action-conditioned)

RoboNet (Action-conditioned, 64x64)

RoboNet (Action-conditioned, 256x256)

Zero-shot Prediction

Visual Planning

Video Samples

Visual Model-based RL

Sample Efficiency

Video Samples

Related Research

Citation