Yann LeCun Is Betting AI Needs a World, Not Just More Words

by Patrix | May 22, 2026

The most interesting argument in AI right now isn’t about which chatbot gives the best answer. It’s about what kind of prediction should sit at the center of intelligence.

The LLM path says, roughly: learn from enormous amounts of language, predict the next token extremely well, then use scale, tools, multimodality, and reinforcement learning to turn that into something useful. That path has already changed the way a lot of us write, code, search, brainstorm, and work. I use these systems every day, and I don’t think you can honestly call them a parlor trick anymore.

Yann LeCun’s JEPA path starts from a different itch. What if the core problem isn’t generating the next word, or even the next pixel? What if the core problem is learning an internal model of the world, one that can understand what matters in a scene, predict what might happen next, and plan actions before taking them?

That’s the part that grabbed me. JEPA feels less like “make the chatbot smarter” and more like “give the machine a mental sketchpad for reality.”

The short version of JEPA

JEPA stands for Joint Embedding Predictive Architecture. The phrase is clunky, but the idea is surprisingly clean once you get past the acronym.

In a typical generative setup, a model tries to reconstruct or generate the thing itself. A language model predicts the next token. An image or video generator may predict missing pixels, frames, or patches. The model is rewarded for getting the observable surface right.

JEPA does something different. It predicts in representation space. Instead of asking a model to fill in every pixel of a missing part of an image or video, JEPA asks it to predict the abstract representation, or embedding, of the missing part. In LeCun’s 2022 position paper, he describes JEPA as non-generative in this specific sense: it doesn’t try to generate the target directly. It tries to capture the dependency between one thing and another by predicting the target’s representation.

That sounds subtle, but it changes the job. If you’re watching a video of a tree, you probably don’t need to predict the exact position of every leaf a fraction of a second later. That detail is mostly noise. You care that the tree is there, that the wind is moving it, that a branch might block the path, that a person walking behind it may be partly hidden. JEPA is built around the idea that a useful world model should be allowed to ignore unpredictable details and focus on the parts of the world that matter for understanding and action.

This is why Meta’s V-JEPA work has centered on video. Video is a brutal test for this idea because the physical world is messy, continuous, and full of irrelevant detail. Pixel-perfect prediction is expensive and often beside the point. Latent prediction, meaning prediction in an abstract learned space, gives the model permission to learn the structure underneath the mess.

What LLMs are good at, and where LeCun thinks they hit a wall

Large language models are usually trained to predict the next token in a sequence. That description can sound dismissive, so it needs a little care. “Next-token prediction” does not mean the system only learns grammar or autocomplete tricks. At modern scale, predicting text well forces a model to absorb a shocking amount of structure from the data: facts, styles, relationships, programming patterns, social conventions, math tricks, fragments of reasoning, and all the weird compressed residue of human culture that shows up in text.

That’s why LLMs feel magical. Text is not just text. It’s a record of what humans noticed, argued about, built, measured, feared, wanted, and explained. A model trained on enough of it can become an eerily useful interface to that record.

LeCun’s objection is not that LLMs are useless. His argument is that text is not enough for human-level or animal-level intelligence. In his 2022 paper, he points out that much of common sense doesn’t live in language at all. A child doesn’t need a paragraph about gravity every time a cup gets pushed toward the edge of a table. They learn from watching, touching, moving, failing, and trying again.

That distinction matters. LLMs can write a good explanation of why the cup will fall. A world model should help a robot avoid knocking it off the table in the first place, or imagine the result of nudging it before the nudge happens.

This is where JEPA is aimed. Not at replacing language models for writing blog posts or answering questions, but at building systems that learn from sensory experience and can use that learning for prediction and planning.

JEPA is not just a theory anymore

LeCun laid out the broader vision in “A Path Towards Autonomous Machine Intelligence” in 2022. That paper proposed a system built around self-supervised learning, predictive world models, intrinsic objectives, and hierarchical representations. At the time, a lot of it read like a research agenda rather than a finished technology.

Since then, Meta FAIR and collaborators have been turning pieces of that agenda into working models.

I-JEPA, introduced in 2023, applied the idea to images. The model learned by predicting representations of target blocks in an image from a context block, instead of reconstructing pixels. The paper reported strong downstream performance with efficient training, including training a ViT-Huge model on ImageNet in under 72 hours using 16 A100 GPUs.

V-JEPA, released in February 2024, moved the idea into video. Meta described it as a non-generative model that predicts masked parts of a video in an abstract representation space. The key claim was efficiency: compared with generative approaches that try to fill in every missing pixel, V-JEPA could discard unpredictable information and improve training and sample efficiency.

Then V-JEPA 2 arrived in 2025, and the work became much more concrete. Meta described it as a 1.2 billion parameter world model trained primarily on video. The technical report says V-JEPA 2 was pretrained on more than 1 million hours of internet video and 1 million images, then post-trained with less than 62 hours of unlabeled robot video from the DROID dataset to create an action-conditioned world model, V-JEPA 2-AC.

That second stage is the important turn. The model isn’t just learning to recognize what’s in a clip. It learns to predict future representations conditioned on actions, which makes it useful for planning. In Meta’s demos and report, V-JEPA 2-AC could do zero-shot robot planning for tasks like reaching, grasping, and pick-and-place in new environments, using goal images and model-predictive control.

As of March 2026, the latest major step I found is V-JEPA 2.1. That paper focuses on dense visual understanding. Instead of only applying the prediction loss to masked tokens, V-JEPA 2.1 uses a dense predictive loss where both visible and masked tokens contribute to training. It also adds deep self-supervision across intermediate layers and multi-modal tokenizers for images and videos. The reported results include stronger object-interaction anticipation, action anticipation, depth estimation, navigation, and a 20-point real-robot grasping improvement over V-JEPA 2-AC.

That’s still research, not a consumer product. But it is not just a philosophical complaint about chatbots anymore.

The real difference: words versus worlds

The cleanest way I can explain the difference is this:

An LLM learns the structure of language so well that language becomes an interface to knowledge.

JEPA tries to learn the structure of the world so well that prediction becomes an interface to action.

Those are not the same ambition.

The LLM approach is strongest where the world has already been translated into tokens: documents, code, transcripts, chat logs, diagrams with captions, databases, tool outputs, and the giant sedimentary layer of internet text. You give it a prompt, it produces a continuation. With enough scaffolding, that continuation can become a plan, a program, an image prompt, a query, or an action through a tool.

The JEPA approach is strongest where the world is not naturally text-shaped. Video, robotics, movement, physical causality, object permanence, occlusion, touch, sound, and timing are all awkward to squeeze into a token stream. You can tokenize anything if you try hard enough, but LeCun’s argument is that the generative-token approach is a bad fit for continuous high-dimensional reality.

This is also why JEPA can sound less immediately impressive than an LLM. A chatbot talks back. A JEPA model may produce an embedding you can’t inspect directly. That makes the demo problem harder. If a language model writes a paragraph, you can judge it. If a world model predicts a future latent state, you need downstream tests: Can it classify the action? Can it anticipate what happens next? Can it guide a robot arm? Can it stay robust when the video is noisy, occluded, or physically tricky?

That’s a less flashy kind of intelligence. But it might be closer to the kind we use when we move through the world.

The approaches are already blending

The easy version of this story is “LLMs are one camp, JEPA is the other.” The current research is messier, and more interesting.

V-JEPA 2 itself can be aligned with a language model for video question answering. That means the JEPA-trained video encoder can provide grounded visual representations, while the language model provides the interface for questions and answers. In other words, language doesn’t disappear. It becomes the way we talk to a model that learned part of its understanding from video.

There is also an ICLR 2026 paper called “LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures,” coauthored by Hai Huang, Yann LeCun, and Randall Balestriero. The paper asks whether language training can borrow JEPA-style embedding-space objectives. Their reported results say LLM-JEPA can outperform standard LLM training objectives across several datasets and model families.

That complicates the argument in a good way. JEPA may not be a replacement for LLMs so much as a missing ingredient. Maybe the future system has language models for communication, JEPA-style world models for perception and prediction, memory systems for persistence, planning modules for action, and tools for doing real work. That sounds less like one giant model and more like an architecture.

LeCun has been arguing for that kind of architecture for years. His move away from Meta into Advanced Machine Intelligence, announced in late 2025, makes the bet even more explicit. The new company is reportedly focused on AI systems that understand the physical world, have persistent memory, reason, and plan complex action sequences. That is almost a direct continuation of the JEPA/world-model program.

Why artists and builders should care

Here’s the ArtsyGeeky angle for me: LLMs gave us a creative partner that can talk. JEPA points toward creative systems that can watch.

That difference is huge.

A language model can help write a concept statement, generate code for an interactive piece, critique a photo series, or brainstorm a weird installation idea. That’s already useful. But it mostly works through description. You translate the work into words, and the model translates words back into suggestions.

A world-model approach hints at something else. Imagine a creative tool that understands the motion of a dancer, not just the caption attached to the video. Or a camera assistant that can anticipate where a subject is moving, not just identify what is in the frame. Or a robot fabrication tool that can reason about how material bends, slips, blocks, collapses, or responds to pressure. Or an editing system that understands continuity, cause and effect, and physical plausibility in footage, not just scene labels.

I’m not saying V-JEPA 2.1 gives us that tomorrow. It doesn’t. The current work is still mostly benchmark-driven and robotics-focused. But the direction feels different from “ask the chatbot to describe the image.” It points toward tools that understand process, motion, and consequence.

For creative technology, that could matter a lot.

The caveats matter

There are plenty of reasons to stay cautious. JEPA models still need better long-horizon planning. Meta’s V-JEPA 2 blog says future work includes hierarchical JEPA models that can reason across multiple temporal and spatial scales, plus multimodal JEPA models that include senses like audio and touch. That is a polite way of saying the current models are still limited.

The robotics results are also early. A robot arm doing pick-and-place with novel objects is meaningful, but it is not general intelligence. Benchmarks can improve faster than real-world reliability. And because JEPA representations are abstract, it can be harder for outsiders to see exactly what the model has learned.

LLMs are not standing still either. The best systems are already multimodal, tool-using, memory-augmented, and increasingly agentic. The next few years probably won’t be a clean victory for one paradigm. They will be a negotiation between approaches.

Still, JEPA gives a name and a technical shape to a discomfort a lot of people have with LLM-only AI. The world is not made of words. Words are one beautiful, powerful projection of the world, but they are not the thing itself.

The bet

LeCun’s bet is that intelligence needs a world model. Not just a bigger pile of text. Not just a better chatbot. A system that can learn from observation, predict in abstraction, ignore irrelevant detail, and plan before acting.

The LLM bet has already paid off in public. We can use it. We can argue with it. We can see its strengths and its weird failure modes every day.

The JEPA bet is earlier and quieter. It lives in embeddings, video encoders, robot arms, masked regions, and benchmarks with names that sound like lab equipment. But it is trying to answer a deeper question: can a machine learn the hidden structure of the physical world well enough to act in it?

That is why I keep coming back to it. The next phase of AI may not be about making machines that talk more convincingly. It may be about making machines that can look at the world and know what is likely to happen next.

And if that works, the chatbot era may end up looking like the first sketch, not the finished painting.