14 6 月, 2026

Introduction to World Models (Part I): What’s a World Model?

If you have been following the…

If you have been following the artificial intelligence space lately, you’ve likely noticed a massive buzzword dominating the conversation: World Models.

Every major research lab is racing to claim they’ve built one. OpenAI hinted that Sora is a “world simulator,” World Labs raised hundreds of millions for “spatial intelligence,” and Yann LeCun insists that true autonomous AI is impossible without a world model at its core.

But what actually is a world model? Is it just a glorified video generator, or is it something fundamentally deeper? As an engineer working in autonomous driving, this isn’t just an academic debate for me—it’s the core engineering puzzle of how we teach vehicles to navigate a chaotic world safely.

Let’s strip away the marketing hype and look at the actual physics, math, and architecture.

Table of Contents

The Two Formulas: POMDP vs. Yann LeCun

To understand world models, we have to look at the two distinct mathematical formulations that dominate the literature.

1. The Classical Classical RL View (POMDP)

In traditional robotics and reinforcement learning, the world is often modeled as a Partially Observable Markov Decision Process (POMDP). Standard RL assumes an agent can perfectly see everything. A POMDP is more realistic: it acknowledges that a robot’s sensors (cameras, LiDAR) only give a partial, noisy, and ambiguous slice of reality.

In this tradition, a world model breaks down into three probabilistic components:

Transition Model $P(s_{t+1} \mid s_t, a_t)$ : How the hidden state of the world (s) evolves when the agent takes an action (a).
Observation Model $P(o_t \mid s_t)$ : How sensor readings (o) are generated from that hidden state.
Reward Model $P(r_t \mid s_t)$ : What those states are worth to the agent.

2. Yann LeCun’s Modern Definition

In his 2022 position paper, Meta’s Chief AI Scientist Yann LeCun proposed a cleaner, more deterministic formulation.

Given an observation x(t), a previous state estimate s(t), an action proposal a(t), and a latent variable z(t), a world model computes:

$\text{Representation: } h(t) = \text{Enc}(x(t))$

\text{Next State Prediction: } s(t+1) = \text{Pred}(h(t), s(t), z(t), a(t))

Three critical things distinguish LeCun’s view:

No Rewards Inside: Costs and rewards live in an entirely separate module. The world model’s only job is to understand physics, not goals.
Deterministic with Latent Variables: The predictor itself is deterministic. The messy, unpredictable nature of the real world is entirely pushed into the latent variable z(t), which parameterizes the set of plausible futures.
No Pixels Required: The model predicts the next representation $s(t+1)$ in an abstract embedding space. Decoding it back into actual pixels is a separate, optional engineering choice (one that LeCun’s JEPA family proudly skips).

The 3 Rules of the “World Model Test”

How do you tell if a model is a true world model or just a fancy video generator? A model must pass a strict membership test based on three core properties:

1. It Must Be Action-Conditioned

A world model doesn’t just predict what happens next; it answers: “What happens next if I take action a(t)?” This is the line in the sand between a world model and a video model like OpenAI’s Sora or Google’s Veo. Sora can generate a gorgeous clip of a jeep driving through dust, but it doesn’t accept a steering or braking command at every single timestep to alter the jeep’s trajectory mid-generation. DeepMind’s Genie 3, however, does—it takes user input (WASD keys) to dynamically alter the environment.

2. It is Causal, Not Correlational

Video generators are trained to maximize pixel likelihood. They find the closest match in their training data and output a plausible next frame.

Imagine dropping a glass beaker onto a trampoline. A video model, having seen thousands of falling glasses shatter, will likely predict the glass shatters on impact. It models correlation. A true world model needs to understand causality—the underlying intuitive physics. It should predict that the glass will bounce.

A 2024 paper from ByteDance Seed (How Far Is Video Generation from World Model) proved this exact flaw: when tested out-of-distribution (OOD) with unseen combinations of shapes and velocities, video generation diffusion models completely collapsed, prioritizing pixel statistics over physical laws.

3. It Must Maintain Multi-Step Consistency

A world model must “close the loop” on itself. Its prediction at $t+1$ becomes the input for its prediction at $t+2$ . In these open-loop rollouts, errors compound exponentially. This is where most models die. While standard video models degrade into a hallucinatory blur within seconds, advanced generative world models like Dreamer 4 can maintain structural coherence over tens of thousands of actions.

Mapping the 5 Philosophical Camps

The AI research ecosystem is currently split into five distinct camps based on two main questions: Is it action-conditioned? and Does it predict in pixel space or abstract latent space?

Camp	Predicts In	Action-Conditioned?	Headline Example	Core Purpose
1. Video Generation	Pixel Space	No	Sora, Veo, Kling	Creative tools; video art
2. Spatial / 3D Intelligence	3D Scene Space	No	World Labs (Marble)	Persistent 3D asset generation
3. Generative World Models	Pixel/Token Space	Yes	Genie 3, Wayve GAIA-2	Agent training inside imagination
4. Latent World Models	Embedding Space	Yes	V-JEPA 2	Highly sample-efficient planning
5. Infrastructure	Platform Space	N/A	NVIDIA Cosmos	“Picks and shovels” computational base

Why This Matters for Autonomous Vehicles

In the autonomous vehicle (AV) industry, we care deeply about Camp 3 (Generative World Models).

Consider Wayve’s GAIA-2. It generates multi-camera driving scenes with precise control over ego-vehicle actions (steering, braking) and environmental factors.

Why not just use real-world driving data? Because the real world is full of “boring” miles. The safety of an AV lives in the long tail—the highly improbable, dangerous edge cases. We can’t safely or easily collect thousands of real-world clips of a pedestrian suddenly stepping off a curb behind a parked truck in heavy rain.

A generative world model lets us synthesize these highly specific, physics-compliant scenarios on demand. It turns an AV training pipeline into an infinitely varied classroom, allowing our driving policies to practice corner cases inside a “dream” before ever hitting real asphalt.

To put it simply: Recall how we mitigate overfitting by using data augmentation techniques? The birth of Generative World Models essentially aims to augment our normal training data with realistic, long-tail scenarios (vehicle cut-ins, pedestrians jaywalking, and active emergency vehicles passing through). By using a high-quality world model, we can generate infinite long-tail scenarios, training our driving policies to understand and handle these critical cases safely.

You might think: once we augment our dataset with these so-called “long-tail scenarios”, would these long-tail cases still be “long-tail”?

The short answer is: Statistically, yes, they are still long-tail in the real world. But computationally inside your training loop, they are intentionally flattened.

Looking Ahead

The great philosophical war in AI right now is between the tribes who believe rendering pixels is necessary (to force the model to stay grounded in visual reality) and those who believe pixels are a harmful distraction (forcing the network to waste capacity memorizing the texture of tree leaves instead of focusing on the macro-physics of motion).

Next week, we will dig directly into that battlefield, unpacking Camp 4 (JEPA) and Camp 5 (NVIDIA Cosmos) to see how abstract embeddings might completely change how robots plan their futures. Stay tuned!

What are your thoughts? Should a self-driving car model have to imagine every pixel of a rainy windshield, or should it only care about the abstract physics of traction and obstacle paths? Let me know in the comments below!

Goro Yeh 56

Introduction to World Models (Part I): What’s a World Model?

The Two Formulas: POMDP vs. Yann LeCun

1. The Classical Classical RL View (POMDP)

2. Yann LeCun’s Modern Definition

The 3 Rules of the “World Model Test”

1. It Must Be Action-Conditioned

2. It is Causal, Not Correlational

3. It Must Maintain Multi-Step Consistency

Mapping the 5 Philosophical Camps

Why This Matters for Autonomous Vehicles

Looking Ahead

Recommended Readings:

Comments

發佈留言取消回覆

Introduction to World Models (Part I): What’s a World Model?

The Two Formulas: POMDP vs. Yann LeCun

1. The Classical Classical RL View (POMDP)

2. Yann LeCun’s Modern Definition

The 3 Rules of the “World Model Test”

1. It Must Be Action-Conditioned

2. It is Causal, Not Correlational

3. It Must Maintain Multi-Step Consistency

Mapping the 5 Philosophical Camps

Why This Matters for Autonomous Vehicles

Looking Ahead

Recommended Readings:

Comments

發佈留言 取消回覆

發佈留言取消回覆