Ongoing Research and the Future

How to Improve World Models

April 6, 2026

Even though we train an LLM to do nothing more than predict the next token, the model learns an abstract representation of how our language describes the world. This abstract internal understanding is often called a world model.

Think of it this way: imagine we feed an LLM a long detective story with many different characters, each engaged in different activities. Near the end of the story there is a sentence like: "and then the great detective said, the person who committed the murder is...". We ask the LLM to predict the next token, fill in the blank, and tell us who did it based on all of the previous text in the entire detective story up to that point.

To answer correctly, the model has to have tracked the motivations, whereabouts, and relationships of every character throughout the entire story. The better an LLM is at predicting the next token (and telling us who comitted the murder after reading our entire detective story), the better it has learned an abstract representation of how our world works.

We are training LLMs to only predict the next text token. But during the process of learning how to predict the next text token more and more accurately, an LLM model is learning to better undestand (1) grammatical rules on how we use language and manipulate words, (2) basic facts about the world that are represented by us using language, and (3) the relationships between different objects and facts about the world that we are encoding when we are using language.

The better a model becomes at predicting the next token, the more it also learns to have a better and more accurate representation of the world, and the more accurate and complete its world model becomes.

Because of this training LLMs to correctly predict the next token has been an extremely fruitful direction when trying to build an AI assistant that posses some basic but genuine knowledge of the world.

To push this further, there are several promising directions.

Multimodal Next-Token Prediction

One direction to improve future LLM models, is to feed the model more and richer data.

We have already used all text data that is essentially available across the internet.

But when a child is learning about the world, only a small fraction of what it learns comes from written text. The vast majority comes from direct observation of the world: watching how physical objects behave, and seeing and hearing how people behave and talk to each other.

There is loads of additional information avaialble on how the world works, that we as humans gather by simply observing how the world works. If we as humans see a situation, we have an general intuition about what will happen next.

When it comes to training AI models, there is an enormous amount of additional information about how the world works encoded in audio and video that we have not yet used to train our models. If we can train a model to not only predict the next text token, but the next frame of a video, or more generally the next state of the world, then we can use all of the audio and video data we have to improve its world model.

This approach (generalising the ideas of LLMs to use and combine multiple input data sources) is often called training multimodal world models, and it is one of the most active areas of research today.

Rather than making a model predict the exact pixel values of the next video frame (which is astronomically complex and mostly irrelevant — we don't care about the exact colour of each pixel, we care about what happens), we can train it to predict an abstract representation of the next frame. Similar to how an LLM also predicts an abstract representation of what the next text tokens describe.

This approach of working with embeddings rather than raw data turns out to be extremely powerful. These representations are stored as high-dimensional vectors, where nearby points correspond to similar concepts. The space they live in is the latent space.

Joint-Embedding Predictive Architectures

Yann LeCun has proposed a different architectural blueprint for world models called a Joint-Embedding Predictive Architecture, or JEPA. The idea here is that predicting the next abstract representation in latent space might still result in very limited intelligent systems.

Instead of storing all of the data as points in a latent space, and training a model to predict the next abstract representation in latent space, it might be much more powerful to train a model to change and manipulate these abstract representations.

Consider two different images of the same underlying scene or concept (for example imges of the same object or person, seen from different perspectives, or partially overlapping segements of the same video). Each view can be independently passed through an encoder network that maps it to a point (an embedding) in a shared latent space.

We do not want both of these views to map to the exact same point in latent space. This is an often encountered problem when training encoders. Often very similr looking inputs (like the same dog, seen from slightly different angles) end up being encoded to the exact same point in latent space. The resulting encoder is not be able to distinguish between these different views of the same object. Everything gets mapped to the same abstract representation. State-of-the-art machine learning recipes have accumulated a toolkit of heuristics to prevent this from happening.

When lots of slightly different representations of the same object all get mapped to the same point in latent space, a predictive network (that has to predict the next latent space represenation) can 'cheat', and achieve very high prediction scores in the low resolution representations that it is working with. (...) When we are training encoders for next token prediction, different views of the same object are likely to still get mapped to the same point in latent space.

By improving the encoder, we are improving the quality of the represenations. Which means we are working in a much righter high resolution representation space. And everything we train on top of this (like predictive networks) will be more useful when they score high prediction scores in this much higher resolution representation space as a consequence.

The core of Yann LeCun's idea is an elegant proposal to improve the training process for these encoders, and achieve this:

The Joint Embedding architecture trains two encoder runs (one per view) and then asks: can we make the embeddings of these two views end up close together in latent space? Because both embeddings live in the same latent space — they are called a joint embeddings. This way the model is forced to learn representations that are invariant to transformations (brightness, crop, rotation) but sensitive to deep semantic content.

This is arguably closer to how biological intelligence works. When you watch someone pick up a glass, you don't consciously simulate the exact position of every photon that will bounce off the glass next. You predict, in an abstract and compressed way, how the glass will rise and tilt.

Where JEPA differs from classical joint embedding methods (SimCLR, DINO), is the "Predictive" in Joint-Embedding Predictive Architecture:

we want the encoder to understand the relationship between different views. We do this by training a small predictor network to take the embedding of one view (the context) and predict the embedding of the other view (the target):

By training our encoder so that it can predict a (target) embedding (of one view), based on the (context) embedding of the other view, force it to learn rich, structured representations of the world. It has to understand enough about the scene to anticipate what the hidden part (visisble from the other view) looks like at an abstract level, without getting lost in irrelevant pixel-level details.

Neurosymbolic AI

Preferably we even want our latent space representations to work in such a way, that we can perform changes of perspective, or other manipulations on the latent space embedding of an object. And allow our network to understand a much richer structure of the abstract representations that we are teaching it.

Notice the predictor in JEPA is a learned function that takes an embedding as input and produces an embedding as output. It is, in precise terms, an operator on latent space — a transformation T: Z → Z that acts entirely in the abstract representational space, never touching raw pixels or tokens.

Now imagine not just one predictor, but a whole library of learned operators, each encoding a different kind of transformation:

"change viewpoint": maps a front-view embedding to a side-view embedding
"apply a force to this object": maps the current state embedding to the post-collision state embedding
"answer the question: is there a person in this scene?": maps a scene embedding to an answer embedding

Each of these operator is a neural network — learned, continuous, differentiable. But the composition of operators is symbolic: you can chain them, and use them as building blocks in a plan.

This is the central promise of neurosymbolic AI: neural representations combined with symbolic composability. The boundary between "neural" and "symbolic" dissolves: the operators are neural (learned, approximate, robust to noise), but their composition is symbolic (discrete, structured, interpretable). You get the best of both worlds.

An agent built this way does not reason in language. It reasons in latent space. Its planning loop looks something like this:

The agent searches over sequences of operators — essentially programs written in latent space — to find a path from the current state to a desired goal state.

Neurosymbolic Agentic Systems

This has a direct and practical implication for agentic systems. Today's orchestration layers pass text strings between an LLM and the tools it calls. Every step involves serialising the model's internal state into language, calling a tool, and deserialising the result back. This is lossy and slow.

In a neurosymbolic architecture, an agent would call operators with embedding inputs and receive embedding outputs, translating back to language or pixels only at the very edges of the system — at the point of human interaction or physical actuation.

Continue reading:Parameter Size isn't Everything