Meta's New AI Just Changed Everything: The End of Token-Based Models?

[HPP] Yann LeCunJanuary 19, 20267 min

16 connections·20 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Inefficiency of Current AI

⚠️ Most current AI models, including chatbots and image generators, function as "text prediction machines" that guess the next word or token.
💡 This approach is wasteful and inefficient, as models learn many ways to express the same idea instead of the core meaning itself.
⏳ The latency of generating text word-by-word makes it unsuitable for real-time applications like robotics and smart glasses.

VL-JEPA's Core Innovation

🚀 VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) radically shifts focus from predicting words to predicting "meaning embeddings."
🧠 An embedding is conceptualized as a coordinate or point on a map representing an idea, allowing the model to operate in a world of concepts, not grammar.
✅ This allows the model to predict the meaning of an entire answer all at once, rather than generating it word by word.

Architectural Breakdown

🔬 The architecture includes a Visual Encoder to process images into essential meaning, and a powerful Predictor that guesses the answer's meaning.
🎯 During training, a Yen encoder provides the correct meaning for the Predictor to aim for, ensuring accurate learning.
💬 A Y decoder is only used as an "afterthought" to translate the pure meaning back into human language when necessary, making the learning process more stable and efficient.

Impressive Performance & Efficiency

📈 VL-JEPA, with only 1.6 billion parameters, achieved double the score in video captioning compared to token-based AIs, using half the trainable parameters.
⚡ It demonstrates nearly three times faster inference due to "selective decoding," only generating output when meaningful changes occur.
🏆 The model set a new state-of-the-art score (65.7%) on physical cause-and-effect tasks, outperforming GPT-4o and Claude 3.5 by directly understanding the world.

Implications for Future AI

🤖 This "meaning-first" approach is essential for real-time applications like robotics and smart glasses, enabling instant reactions and relevant information delivery.
🔑 VL-JEPA represents a fundamental shift in AI, moving the focus from language generation to a deeper understanding of meaning and semantic world modeling.
❓ It prompts a re-evaluation of intelligence, suggesting that a deep functional understanding of the world might be achievable without mastering human grammar.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph20 entities · 16 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

20 entities

Chapters4 moments

Key Moments

Transcript29 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Meta AIVL-JEPALarge Language ModelsToken-based modelsMeaning embeddingsPredictive architectureVisual encoderSelective decodingWorld modelingRoboticsReal-time applicationsArtificial intelligenceMachine learningDeep learningSemantic world modeling

Smart Objects20 · 16 links

Products· 5

Media· 1

Concepts· 13

Company· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free