Humans construct internal models of the world and reason by manipulating the concepts within
these models. Recent advances in artificial intelligence (AI), particularly chain-of-thought (CoT)
reasoning, approximate such human cognitive abilities, where world models are believed to be
embedded within large language models. Expert-level performance in formal and abstract domains
such as mathematics and programming has been achieved in current systems, which rely
predominantly on verbal reasoning as their primary information-processing pathway. However, they
still lag far behind humans in domains like physical and spatial intelligence, which require richer
representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable
of both verbal and visual generation has therefore sparked interest in more human-like reasoning
grounded in complementary multimodal pathways, though a clear consensus on their benefits has
not yet been reached. From a world-model perspective, this paper presents the first principled study
of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis:
for certain tasks--particularly those grounded in the physical world--visual
generation more naturally serves as world models, whereas purely verbal world models encounter
bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically,
we formalize internal world modeling as a core component of deliberate CoT reasoning and analyze
distinctions among different forms of world models from both informativeness and knowledge
aspects. Empirically, we identify and design tasks that necessitate interleaved visual-verbal CoT
reasoning, constructing a new evaluation suite, VisWorld-Eval. Through controlled
experiments on a state-of-the-art UMM, we show that interleaved CoT significantly outperforms
purely verbal CoT on tasks that favor visual world modeling. Conversely, it offers no clear advantage
for tasks that do not require explicit visual modeling. Together, these insights and findings clarify the
applicability and potential of multimodal world modeling and reasoning for more powerful, human-like
multimodal AI. We publicly release our evaluation suite to facilitate further research.