World Models | Loghman Samani

The core problem of reinforcement learning is deceptively simple. An agent interacts with an environment, receives rewards, and learns a policy that maximizes long-term return. Model-free methods attack this problem through trial and error. They work, eventually, but they are profligate with data. Training a policy from scratch on a complex task can require millions of environment interactions.

Model-based RL takes a fundamentally different approach. The agent first learns a model of the environment and then uses that model to plan, generate synthetic data, or compute gradients for policy improvement. By utilizing an internal model, the agent can plan a trajectory of actions ahead of time that leads to the desired behavior, rather than relying purely on trial and error.

Model-based methods broadly subdivide into two directions. The first is planning, which can be advantageous over a learned policy but becomes prohibitively expensive over long horizons. The second is augmenting model-free methods by generating synthetic rollouts from a learned model, improving sample efficiency at the risk of propagating model biases into the policy. Managing this tension between efficiency and accuracy has driven most of the field's progress.

Over three decades, this paradigm has evolved from tabular architectures into foundation-scale world models that generate interactive 3D environments in real time. This post surveys that trajectory through 26 papers, following the thread from Sutton's Dyna to DeepMind's Genie 3.

Planning from Experience

Richard Sutton's Dyna architecture [1] introduced one of the clearest formulations of model-based RL. The central idea is that an agent can plan by "trying things in its head," using an internal model of the environment. In Dyna, the agent learns an action model that predicts the consequences of actions. Given a state and an action, the model predicts the resulting reward and next state. This learned model then generates hypothetical experiences, which are treated identically to real interactions.

The Dyna algorithm is elegant. At each step, the agent observes the environment, updates its policy from the real transition, updates its dynamics model, and then performs K additional planning steps using model-generated transitions. By applying RL updates to both real and simulated experience, the agent extracts far more value from every real interaction than a purely model-free approach.

But uniform planning is wasteful in large state spaces. Most imagined transitions will not change the value function significantly. Moore and Atkeson [2] addressed this with prioritized sweeping, which maintains a priority queue that focuses computation on states whose values are most likely to change. When a real transition triggers a significant update, the algorithm traces backward through predecessor states in order of estimated impact, propagating information efficiently. Prioritized sweeping demonstrated faster convergence than both classical dynamic programming and model-free methods in large Markov decision processes. The core idea of prioritizing experience replay would later resurface in deep RL.

Taming Model Bias

Moving from tabular methods to continuous domains revealed a fundamental challenge. When an agent plans using an imperfect learned dynamics model, errors compound over long rollouts. The policy may exploit inaccuracies in the model rather than learning genuinely useful behaviors. This problem, model bias, limited the practical applicability of model-based methods for years.

PILCO [3] confronted model bias directly. Rather than fitting a single deterministic dynamics model, PILCO learns a Gaussian process (GP) that maintains a full distribution over possible transition functions. The key insight is illustrated in Figure 1. With limited data, many different dynamics models could plausibly explain the observed transitions. A deterministic model picks one and commits to it with false confidence. Outside the training region, its predictions are essentially arbitrary, yet it claims them with certainty. A probabilistic model, by contrast, expresses honest uncertainty about what it doesn't know.

PILCO model uncertainty illustration — **Figure 1.** With limited transition data (left), multiple dynamics models may plausibly explain observations (center). A deterministic model produces overconfident predictions outside the training region, while a probabilistic model maintains calibrated uncertainty (right). From Deisenroth and Rasmussen [3].

PILCO propagates this uncertainty through the entire planning horizon using deterministic approximate inference over predicted state distributions. Policy gradients are computed analytically with respect to the policy parameters, without requiring an explicit value function. The result is extreme data efficiency, with control policies learned from scratch in a small number of real-world interactions.

Heess et al. introduced the Stochastic Value Gradient (SVG) framework [4], which found a way to backpropagate gradients through stochastic dynamics models. The re-parameterization trick expresses stochastic variables as deterministic functions of exogenous noise. A stochastic transition $s_{t+1} \sim p(s_{t+1}|s_t, a_t)$ is rewritten as $s_{t+1} = f(s_t, a_t, \xi)$ with $\xi \sim \rho(\xi)$, and similarly for stochastic policies. This allows gradients to flow through the model, policy, and value function along real trajectories rather than model-generated rollouts, reducing sensitivity to model inaccuracies while still exploiting the model for analytic gradient computation. SVG bridged model-based and model-free RL in a principled way.

On the representation side, Finn et al. [5] showed that deep spatial autoencoders could extract low-dimensional feature points from raw images, producing state representations specifically designed for control. By learning dynamics models in this compact spatial coordinate space and defining costs based on distances between current and target features extracted from goal images, robotic systems acquired complex visuomotor skills like pushing objects and scooping with a spatula, entirely from raw sensory input.

Deep Dynamics

As neural networks grew more capable, they became the natural choice for dynamics modeling. Nagabandi et al. [6] demonstrated that medium-sized neural networks, combined with model predictive control (MPC), could achieve strong sample efficiency on complex locomotion tasks. Their approach models transitions as $s_{t+1} = s_t + f_\theta(s_t, a_t)$, a residual formulation that simplifies learning when consecutive states are similar. The model is trained by minimizing squared prediction error on observed transitions, and control happens through MPC, which repeatedly solves a finite-horizon optimization, executes only the first action, and replans at the next step. This receding-horizon strategy compensates for modeling errors. They also proposed a hybrid strategy where a policy is first trained to imitate the MPC controller and then refined with model-free RL, combining the sample efficiency of model-based planning with the asymptotic performance of model-free optimization.

Ha and Schmidhuber's World Models [7] brought a conceptual shift that reframed the entire field. Their architecture decomposes the problem into three learnable components. A vision model (V), implemented as a variational autoencoder, compresses each observation frame into a compact latent vector $z_t$. A memory model (M), built as an MDN-RNN, predicts the distribution of the next latent state $P(z_{t+1}|a_t, z_t, h_t)$, where $h_t$ is the RNN's hidden state summarizing past information. A controller (C) selects actions from the current latent state and memory through a simple linear mapping $a_t = W_c[z_t, h_t] + b_c$.

World Models V-M-C architecture — **Figure 2.** World Models architecture. The vision model (V) compresses observations into a compact latent vector, the memory model (M) predicts temporal dynamics, and the controller (C) selects actions based on learned representations. From Ha and Schmidhuber [7].

The revolutionary idea was that once the world model is learned, policy training can happen entirely "inside the dream." The agent generates imagined trajectories using the learned model and optimizes the controller without touching the real environment. Policies trained in this compressed, hallucinated world transferred back to reality, establishing that model-based agents could learn through imagination.

But how far should you trust your learned model? Janner et al. gave a rigorous answer with MBPO [8]. They derived a bound relating the true return $\eta[\pi]$ of a policy in the real environment to its estimated return under the learned model $\hat{\eta}[\pi]$, showing that improvements under the model translate to real improvements only when $C(\epsilon_m, \epsilon_\pi)$, a term depending on model error $\epsilon_m$ and policy divergence $\epsilon_\pi$, is sufficiently small. Since long rollouts amplify model errors, the practical prescription is clear. Use short model rollouts branched from real environment states, and combine these synthetic transitions with real data for training a model-free optimizer like Soft Actor-Critic. By decoupling the task horizon from the model rollout horizon, MBPO avoids compounding errors while still gaining the data efficiency of model-based learning.

PETS [9] attacked model uncertainty from a complementary angle, combining ensembles of bootstrapped probabilistic neural networks. Each network outputs a Gaussian distribution to capture aleatoric uncertainty (the inherent stochasticity of the environment), while ensemble disagreement captures epistemic uncertainty (what the model does not yet know). Trajectory sampling propagates this dual uncertainty through the planning horizon using particles, with each particle assigned to a bootstrap and re-sampled according to its probabilistic prediction at each timestep. The result was striking: model-free asymptotic performance with up to 125x fewer samples on complex continuous control tasks.

Learning to Dream

PlaNet [10] introduced the Recurrent State Space Model (RSSM), an architecture that became the foundation for the next generation of world models. The key design decision, shown in Figure 3, is splitting the latent state into deterministic and stochastic components. Purely deterministic transitions, as in a standard RNN, prevent the model from representing multiple possible futures and make it vulnerable to planner exploitation. Purely stochastic transitions struggle to maintain information over long horizons. The RSSM combines both:

$$h_t = f(h_{t-1}, s_{t-1}, a_{t-1}), \quad s_t \sim p(s_t | h_t), \quad o_t \sim p(o_t | h_t, s_t)$$

The deterministic path, implemented as an RNN, provides long-range memory. The stochastic path captures the irreducible uncertainty of future observations. PlaNet trains this model with a generalized variational objective called latent overshooting, which regularizes the model to maintain consistency between single-step and multi-step predictions. For control, PlaNet performs all planning in latent space using the Cross-Entropy Method, achieving sample efficiency competitive with model-free algorithms like D4PG while requiring far fewer interactions.

PlaNet RSSM designs — **Figure 3.** Latent dynamics designs. (a) Purely deterministic RNN transitions prevent modeling multiple futures. (b) Purely stochastic transitions lose long-term memory. (c) The RSSM splits the state into deterministic and stochastic parts. The model observes the first two timesteps and predicts the third. From Hafner et al. [10].

But online planning at every step is computationally expensive. Dreamer [11] replaced it with learned behaviors. Instead of searching for action sequences at each decision point, Dreamer trains an actor-critic system entirely within the RSSM's latent world model. The agent imagines trajectories in latent space and backpropagates analytic gradients through the imagined dynamics to update the policy. The value function uses $\lambda$-returns to balance bias and variance across imagined horizons:

$$V_\lambda(s_\tau) = (1-\lambda)\sum_{n=1}^{H-1}\lambda^{n-1}V_N^n(s_\tau) + \lambda^{H-1}V_N^H(s_\tau)$$

Dreamer components — **Figure 4.** Dreamer operates in three interleaved phases. (a) Learning to encode observations and predict rewards from the dataset of past experience. (b) Learning state values and actions that maximize predicted returns by backpropagating through imagined trajectories. (c) Executing the policy in the environment. From Hafner et al. [11].

Dreamer established the core pattern: learn a world model, then learn a policy inside it. It achieved model-free asymptotic performance with substantially better sample efficiency.

DreamerV2 [12] pushed this framework forward by replacing continuous Gaussian latents with discrete categorical variables: vectors of 32 variables with 32 classes each, optimized using straight-through gradients. Discrete representations capture the non-smooth, multi-modal transitions typical of complex visual environments better than continuous ones. To handle the bootstrap problem of regularizing representations toward an initially poor prior, DreamerV2 introduced KL balancing:

$$\mathcal{L}_{KL} = \alpha \, KL[\text{sg}(q) \| p] + (1-\alpha) \, KL[q \| \text{sg}(p)]$$

where $\text{sg}$ denotes stop-gradient. This applies asymmetric learning rates, pulling the prior toward the representations more aggressively than the reverse. DreamerV2 became the first model-based agent to achieve human-level performance across the Atari benchmark, surpassing model-free methods like Rainbow and IQN.

DreamerV3 [13] completed the trilogy by solving the generality problem. Using a single set of fixed hyperparameters, it mastered diverse domains from Atari to DMLab to Minecraft. The core innovation is symlog predictions, which compress targets to handle varying reward magnitudes:

$$\text{symlog}(x) = \text{sign}(x) \ln(|x| + 1)$$

The reward predictor and critic are parameterized as categorical distributions over exponentially spaced bins using a symexp twohot loss, fully decoupling gradient scales from reward magnitude. DreamerV3 collected diamonds in Minecraft from scratch, a long-standing challenge in AI that requires both deep exploration and long-horizon planning, without human demonstrations or curricula.

DreamerV3 training process — **Figure 5.** DreamerV3 training loop. The world model encodes sensory inputs into discrete representations $z_t$ predicted by a sequence model with recurrent state $h_t$ given actions $a_t$. Inputs are reconstructed to shape the representations. The actor and critic learn from imagined trajectories predicted by the world model. From Hafner et al. [13].

Transformers, Diffusion, and Foundation Models

IRIS [14] reframed world modeling as sequence prediction. A discrete autoencoder compresses raw images into a vocabulary of visual tokens. An autoregressive, GPT-like transformer then predicts future tokens conditioned on past observations and actions, generating rewards and termination signals alongside state predictions. The agent trains entirely within this transformer-simulated imagination, achieving human-level Atari 100k performance after just two hours of gameplay. The approach demonstrated that the same architectural principles driving progress in language modeling could power world models for RL.

IRIS imagination unrolling — **Figure 6.** Imagination in IRIS. The policy (purple) selects actions while the transformer (blue) autoregressively simulates dynamics. The discrete autoencoder (green) provides the symbolic vocabulary for frames, enabling the agent to predict rewards and episode endings within the latent space. From Micheli et al. [14].

TD-MPC [15] took the opposite approach to representation. Rather than tokenizing observations, it learns a task-oriented latent dynamics (TOLD) model purely through reward and value signals. The representation $z_t = d_\theta(s_t)$ is trained via temporal difference learning, minimizing a multi-step prediction loss that ensures temporal consistency. At each step, the agent performs local trajectory optimization (MPC) over a short horizon using the learned model for immediate reward estimates and a terminal value function $V_\phi(z_{t+H})$ for long-term returns. No reconstruction. No pixel generation. Just a latent space shaped entirely by what matters for the task.

Perhaps the most significant architectural shift came with DIAMOND [16], which introduced diffusion models as world models. Rather than compressing observations into discrete tokens that might lose critical visual details like distant objects or small UI elements, DIAMOND models environment transitions $p(x_{t+1}|x_{\leq t}, a_{\leq t})$ directly in pixel space. The core is a score-based diffusion model trained to denoise a perturbed observation conditioned on the history of past frames and actions:

$$\mathcal{L}(\theta) = \mathbb{E}\left[\|D_\theta(x_{t+1}^\tau, \tau, x_{0 \leq t}, a_{0 \leq t}) - x_{t+1}^0\|^2\right]$$

DIAMOND imagination unrolling — **Figure 7.** DIAMOND imagination. Environment time flows horizontally while denoising steps ($\tau$) flow vertically, transforming noisy samples into clean next observations. The process is autoregressive: predicted frames become conditioning for subsequent steps. From Alonso et al. [16].

DIAMOND achieved a mean human-normalized score of 1.46 on Atari 100k, the best among agents trained entirely within a world model. Its generative capacity was further demonstrated as an interactive neural game engine for Counter-Strike: Global Offensive, simulating complex 3D dynamics, muzzle flashes, and movement with enough consistency to remain interactive.

Figure 8. DIAMOND's diffusion world model trained to simulate Counter-Strike: Global Offensive, replacing the game engine entirely with a neural network. From Alonso et al. [16].

UniSim [17] scaled the world model concept to real-world simulation. A conditional diffusion model, trained on diverse datasets spanning human activities, navigation, and robotics, learns to model the transition distribution $p(x_t|x_{<t}, a_{t-1})$ where different action modalities (text instructions, motor torques, camera trajectories) are mapped into a unified continuous embedding space. The factorized trajectory distribution $p(x_{1:T}|a_{1:T-1}, x_0) = \prod_{t=1}^T p(x_t|x_{t-k:t-1}, a_{t-1})$ ensures temporal consistency by conditioning on a fixed window of recent frames. By training on a broad distribution of real-world data, UniSim bridges the sim-to-real gap: embodied planners and control policies trained in the simulator can be deployed in physical environments, with sparse robotics data benefiting from rich visual priors in internet video.

The Genie series from Google DeepMind then pushed toward foundation world models. Genie [18] demonstrated that a controllable world model could be learned from over 200,000 hours of unlabeled internet gaming videos. The key innovation is learning a latent action space without ground-truth action annotations. A spatiotemporal video tokenizer maps frames to discrete tokens, a Latent Action Model discovers the structure of transitions between frames, and a MaskGIT-based dynamics model autoregressively predicts future states. At inference, the dynamics model predicts $z_{t+1} \sim p(z_{t+1}|z_{1:t}, a_{1:t})$ given past frame tokens and user-provided latent actions, generating a diverse range of interactive 2D environments from platformers to sketches.

Genie inference pipeline — **Figure 9.** Genie inference. The video tokenizer converts frames into discrete tokens while the dynamics model iteratively predicts future frames conditioned on tokens and user-defined latent actions, enabling real-time interaction within generated environments. From Bruce et al. [18].

Genie 2 [19] marked the transition from 2D to rich 3D. Built as an autoregressive latent diffusion model with a causal transformer backbone, it generates interactive 3D environments from a single image prompt. A major technical advance is long-horizon memory: the model remembers and accurately re-renders parts of the environment that have temporarily left the agent's field of view, maintaining consistency for up to a minute. Genie 2 also demonstrated emergent capabilities including complex character animations, realistic physics (collisions, gravity, fluid dynamics), and NPC behavior, supporting diverse perspectives and keyboard-and-mouse control through classifier-free guidance.

Genie 3 [20] achieved what its predecessors could not: genuine real-time interactivity. Generating photorealistic 3D worlds at 720p resolution and 24 FPS, Genie 3 maintains spatial and visual consistency for several minutes, with visual memory extending as far as one minute into the past. It also introduced promptable world events, allowing users to alter the simulation mid-session through natural language, such as changing weather conditions or spawning new characters. The SIMA embodied agent was deployed inside Genie 3 worlds for goal-directed navigation, validating these AI-generated environments as practical training grounds.

Figure 10. Genie 2 (left) vs Genie 3 (right) as a robot navigates a city street. Genie 3 exhibits significantly higher visual clarity, more stable geometry, and consistent lighting over long durations, while Genie 2 begins to lose coherence as the agent explores further.

From Imagination to Reality

DayDreamer [21] proved that dream-based training works on physical hardware. Applying the Dreamer algorithm directly to real robots, the system uses an asynchronous architecture where one process collects experience in the real world while a separate process continuously updates the world model and policy through latent imagination. An A1 quadruped learned to roll, stand, and walk from scratch in one hour of real-time interaction. Compared to model-free methods that would require days, this demonstrated that world models can effectively sidestep the complexities of sim-to-real transfer by learning the dynamics of reality directly.

DayDreamer architecture — **Figure 11.** DayDreamer runs two processes in parallel. The environment process collects experience and populates a replay buffer, while the training process learns a world model and optimizes a policy through latent imagination. The robot learns continuously during interaction. From Wu et al. [21].

UniPi [22] reimagined the control problem entirely, replacing the standard mapping from states to actions with text-conditioned video generation. A diffusion model acts as a high-level planner, synthesizing a video depicting the completion of a text-specified goal. A low-level inverse dynamics model then extracts the control actions needed to follow the generated trajectory. This "policy as video" formulation uses images as a universal interface to bridge different environments and robot embodiments. The framework achieves combinatorial generalization, solving novel tasks without additional fine-tuning, and bypasses manual reward design by using natural language as the primary goal specification.

The Neural Engine Era

GameNGen [23] demonstrated that a neural model could replace a traditional game engine entirely. The pipeline has two phases. First, an RL agent plays DOOM and its gameplay trajectories are recorded. Second, a generative diffusion model (fine-tuned from Stable Diffusion v1.4) is trained to predict the next frame conditioned on a history of past frames and actions. To prevent sampling divergence during autoregressive generation, conditioning augmentations add varying noise levels to context frames during training. The result is a real-time neural simulation at 20 FPS on a single TPU, with human raters barely exceeding chance when distinguishing the neural simulation from the real game.

GameNGen methodology — **Figure 12.** GameNGen pipeline. An RL agent generates gameplay trajectories (left) that train a diffusion model to predict next frames conditioned on past observations and actions (right). From Valevski et al. [23].

Oasis [24], developed by Decart and Etched, extended this to open-world 3D. Using a Diffusion Transformer (DiT) architecture, it generates Minecraft gameplay at 20 FPS directly from keyboard and mouse inputs. A spatial autoencoder compresses frames into latent space, and a latent diffusion backbone predicts subsequent frames autoregressively. Diffusion forcing, which allows varying noise levels across the context window, maintains temporal consistency. Oasis became the first publicly playable neural world model, demonstrating that transformer-based generation (without any game engine) can sustain 3D environment consistency.

NVIDIA's Cosmos [25] addresses the broader infrastructure problem. As a world foundation model platform for physical AI, Cosmos is pre-trained on over 100 million high-quality video clips to internalize physical laws across a wide range of scenarios. Offering both transformer-based diffusion and autoregressive architectures, the platform is designed for post-training on task-specific datasets, whether for autonomous driving, robotic manipulation, or camera control. Released as open-weight and open-source, Cosmos provides the community with a general-purpose foundation for building specialized world models.

V-JEPA 2 [26] from Meta introduces a fundamentally different perspective. Rather than generating pixels, it predicts future states in a learned latent representation space using a Joint-Embedding Predictive Architecture. Pre-trained on over one million hours of internet video through self-supervised masking objectives, V-JEPA 2 captures physical dynamics without generating any visual output. By predicting abstract representations rather than reconstructing observations, the model filters out unpredictable visual noise and focuses on essential dynamics. Its action-conditioned variant, V-JEPA 2-AC, is post-trained on less than 62 hours of unlabeled robot video and enables zero-shot robotic planning for pick-and-place tasks in novel environments, without task-specific data, fine-tuning, or reward engineering. This validates an important hypothesis: world models do not need to be generative to be effective for planning and control.

Looking Ahead

The path from Dyna's tabular planning to Genie 3's real-time world generation spans more than three decades, but the foundational insight has not changed. An agent that can simulate the consequences of its actions internally gains a decisive advantage in sample efficiency, planning capability, and generalization.

What has changed is the scale and fidelity of that simulation. Dyna imagined transitions in a lookup table. PILCO modeled uncertain dynamics with Gaussian processes. The Dreamer series taught agents to dream in compact latent spaces. And the current generation of foundation world models generates entire interactive environments from text prompts, at resolutions and frame rates that were the domain of traditional game engines just a few years ago.

Two competing paradigms are taking shape. Generative world models, from diffusion-based systems like DIAMOND, Genie 3, and Cosmos to autoregressive transformers like IRIS and Oasis, argue that high-fidelity generation is necessary for agents to reason about their environments. Latent predictive models like V-JEPA 2 argue that predicting in representation space is sufficient, more efficient, and more robust. Whether one of these approaches will dominate, or whether some synthesis of both will emerge, is among the most interesting open questions in the field.

What seems clear is that the boundary between reinforcement learning, generative modeling, and physical simulation is dissolving. World models sit at the convergence point, and the rate of progress suggests we are still in the early chapters of this story.