Visual Generation Unlocks Human-Like Reasoning
through Multimodal World Models 🌏

Jialong Wu1,2, Xiaoying Zhang#2, Hongyi Yuan2, Xiangcheng Zhang1,2, Tianhao Huang1, Changjing He1, Chaoyi Deng1,2, Renrui Zhang2, Youbin Wu2, Mingsheng Long#1
1Tsinghua University, 2ByteDance Seed
# Corresponding Author
TL;DR

From a world-model perspective, we study
When and how visual generation enabled by unified multimodal models (UMMs)
benefits reasoning?

Abstract

Humans construct internal models of the world and reason by manipulating the concepts within these models. Recent advances in artificial intelligence (AI), particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems, which rely predominantly on verbal reasoning as their primary information-processing pathway. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though a clear consensus on their benefits has not yet been reached. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of deliberate CoT reasoning and analyze distinctions among different forms of world models from both informativeness and knowledge aspects. Empirically, we identify and design tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Through controlled experiments on a state-of-the-art UMM, we show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling. Conversely, it offers no clear advantage for tasks that do not require explicit visual modeling. Together, these insights and findings clarify the applicability and potential of multimodal world modeling and reasoning for more powerful, human-like multimodal AI. We publicly release our evaluation suite to facilitate further research.

🌏 A World Model Perspective of Multimodal Reasoning

  • World Model in Human Minds: Humans construct mental models of the world, representing information and knowledge through two complementary channels–verbal and visual– to support reasoning, planning, and decision-making.
  • Reasoning with Verbal World Models: Recent advances in large language models (LLMs) and vision language models (VLMs) largely rely on verbal chain-of-thought reasoning, leveraging primarily verbal and symbolic world knowledge.
  • Reasoning with Visual World Models: Unified multimodal models (UMMs) open a new paradigm by using visual generation for visual world modeling, advancing more human-like reasoning on tasks grounded in the physical world.

⚙️ Visual Superiority Hypothesis

Formulation

  • Multiple Observations of the World: Observations of the same underlying world state can span multiple modalities, including verbal and visual observations.
  • Atomic Capabilities of World Models:
    • World reconstruction: infers complete structure from partial observations and enables novel view synthesis,
    • World simulation: models dynamics to predict future observations.
  • World Model-Based Chain-of-Thought Formulations: CoT reasoning includes internal world modeling, by explicitly maintaining an evolving sequence of observations as evidence of reasoning.
    • Reasoning with implict world modeling: purely verbal CoTs, with no explicit observation is generated.
    • Reasoning with visual world modeling: purely verbal CoTs, with observations as verbal descriptions;
    • Reasoning with verbal world modeling: interleaved verbal-visual CoTs, with observations as generated images;

Position

Based on our analysis from both informativeness and knowledge aspects:

The Visual Superiority Hypothesis

In multimodal reasoning tasks grounded in the physical world, visual generation as a world model yields representations that are more informative and knowledge-rich than those produced by verbal world models.

🏆 VisWorld-Eval: Task Suite for Reasoning with Visual World Modeling

While prior work has primarily designed evaluation tasks heuristically, VisWorld-Eval comprises seven tasks spanning both synthetic and real-world domains, principledly designed to isolate and demand specific atomic world-model capabilities.

  • World simulation : Paper folding, Multi-hop manipulation, Ball tracking, Maze, and Sokoban
  • World Reconstruction : Cube 3-view projection, Real-world spatial reasoning

Leaderboard

Zero-shot evaluation of advanced VLMs on VisWorld-Eval. We report the average accuracy over five tasks (excluding Maze and Sokoban) and over all seven tasks.

Models Paper Folding Multi-Hop Manip. Ball Tracking Cube 3-View MMSI (Pos. Rel.) Maze Sokoban Overall (5 tasks) Overall (7 tasks)
Proprietary Models
Gemini 3 Flash 25.6 75.4 55.3 52.7 41.3 73.9 99.3 50.0 60.5
Gemini 3 Pro 27.0 74.5 44.7 53.3 49.6 33.5 90.2 49.8 53.2
Seed 1.8 10.6 75.2 24.4 42.5 38.8 83.9 68.3 38.3 49.1
GPT 5.1 6.4 73.9 34.8 44.5 44.8 0.6 62.8 40.8 38.2
o3 13.5 68.1 24.7 37.7 44.4 0.0 36.0 37.6 32.0
Open-Source Models
Qwen3-VL-8B-Thinking 11.0 49.3 17.8 21.2 27.7 0.0 5.8 25.4 18.9
BAGEL-7B-MoT 11.2 31.6 19.4 26.8 27.2 0.0 0.2 23.2 16.6

🧪 Key Findings

Settings

  • Data construction: For each task, we construct SFT data by designing different CoT patterns with implicit, verbal, or visual world modeling, enabling controlled comparative evaluations.
  • Post-training UMMs: We perform SFT and RLVR on a state-of-the-art open-source unified multimodal model, BAGEL.

Results

  1. Visual World Simulation Boosts Multimodal Reasoning: on paper folding, multi-hop manipulation, and ball tracking.
  2. Visual World Reconstruction Boosts Multimodal Reasoning: on cube 3-view projection and real-world spatial reasoning.
  3. Visual World Modeling is Unhelpful for Certain Tasks: on maze and sokoban. For these tasks with simple world states, verbal world modeling, even implicit world modeling, is sufficient.
  4. Do UMMs Compromise Verbal Reasoning Capabilities and Bias Comparisons? No. SFT performance of Qwen2.5-VL with implicit and verbal world modeling is comparable to that of BAGEL.
  5. RL Enhances Various CoTs, Yet Does Not Close the Gap.

Performance of SFT-trained UMMs across seven tasks from VisWorld-Eval.

Performance of RLVR-trained VLMs and UMMs across three representative tasks.

🌌 Showcase

Showcases of interleaved verbal-visual chain-of-thought reasoning, generated by post-trained UMMs, where visual generation serves as world models.
<image> denotes a placeholder indicating the position of a generated image.

Citation

@article{wu2026visual,
    title={Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models}, 
    author={Jialong Wu and Xiaoying Zhang and Hongyi Yuan and Xiangcheng Zhang and Tianhao Huang and Changjing He and Chaoyi Deng and Renrui Zhang and Youbin Wu and Mingsheng Long},
    journal={arXiv preprint arXiv:2601.xxxxx},
    year={2026},
}