Skip to main content

Meta's New AI Just Changed Everything: The End of Token-Based Models?

[HPP] Yann LeCunJanuary 19, 20267 min
16 connections·20 entities in this video

The Inefficiency of Current AI

  • ⚠️ Most current AI models, including chatbots and image generators, function as "text prediction machines" that guess the next word or token.
  • 💡 This approach is wasteful and inefficient, as models learn many ways to express the same idea instead of the core meaning itself.
  • ⏳ The latency of generating text word-by-word makes it unsuitable for real-time applications like robotics and smart glasses.

VL-JEPA's Core Innovation

  • 🚀 VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) radically shifts focus from predicting words to predicting "meaning embeddings."
  • 🧠 An embedding is conceptualized as a coordinate or point on a map representing an idea, allowing the model to operate in a world of concepts, not grammar.
  • ✅ This allows the model to predict the meaning of an entire answer all at once, rather than generating it word by word.

Architectural Breakdown

  • 🔬 The architecture includes a Visual Encoder to process images into essential meaning, and a powerful Predictor that guesses the answer's meaning.
  • 🎯 During training, a Yen encoder provides the correct meaning for the Predictor to aim for, ensuring accurate learning.
  • 💬 A Y decoder is only used as an "afterthought" to translate the pure meaning back into human language when necessary, making the learning process more stable and efficient.

Impressive Performance & Efficiency

  • 📈 VL-JEPA, with only 1.6 billion parameters, achieved double the score in video captioning compared to token-based AIs, using half the trainable parameters.
  • ⚡ It demonstrates nearly three times faster inference due to "selective decoding," only generating output when meaningful changes occur.
  • 🏆 The model set a new state-of-the-art score (65.7%) on physical cause-and-effect tasks, outperforming GPT-4o and Claude 3.5 by directly understanding the world.

Implications for Future AI

  • 🤖 This "meaning-first" approach is essential for real-time applications like robotics and smart glasses, enabling instant reactions and relevant information delivery.
  • 🔑 VL-JEPA represents a fundamental shift in AI, moving the focus from language generation to a deeper understanding of meaning and semantic world modeling.
  • ❓ It prompts a re-evaluation of intelligence, suggesting that a deep functional understanding of the world might be achievable without mastering human grammar.
Knowledge graph20 entities · 16 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
20 entities
Chapters4 moments

Key Moments

Transcript29 segments

Full Transcript

Topics15 themes

What’s Discussed

Meta AIVL-JEPALarge Language ModelsToken-based modelsMeaning embeddingsPredictive architectureVisual encoderSelective decodingWorld modelingRoboticsReal-time applicationsArtificial intelligenceMachine learningDeep learningSemantic world modeling
Smart Objects20 · 16 links
Products· 5
Media· 1
Concepts· 13
Company· 1