Skip to main content

JEPA Models Explained: I-JEPA, V-JEPA, VL-JEPA & Real-World AI Learning

[HPP] Yann LeCunDecember 31, 202516 min
25 connections·40 entities in this video→

The JEPA Philosophy: Learning by Observation

  • πŸ’‘ Joint Embedding Predictive Architectures (JEPA) represent a critical pivot in model design, moving away from pixel-perfect generation.
  • 🧠 The core idea is for AI models to learn by observing and understanding the world, similar to how humans and animals learn, rather than painstakingly reconstructing every detail.
  • πŸ”‘ Championed by researchers like Yann LeCun, this approach aims to build reliable internal world models through passive observation, leading to vastly more efficient and scalable AI.

Abstract Representations vs. Raw Output

  • 🎯 JEPA's key advantage is focusing on the predictable signal and crucially ignoring noise, unlike traditional generative models (e.g., diffusion models or Nvidia Cosmos) that must model every single detail.
  • πŸ—‘οΈ Generative models waste computational cycles by trying to predict unpredictable, high-frequency details (like individual blades of grass or light flickers), which are often irrelevant to the task.
  • βœ… JEPA learns representations only for the predictable parts of a scene, such as object trajectories or main motion, making the learning objective simpler and more stable for scaling.

V-JEPA2: Scaling for Robustness

  • πŸš€ V-JEPA2 utilizes a mask denoising feature prediction objective to infer missing high-level structure in video by predicting learned representations, not low-level details.
  • πŸ›‘ The stop gradient is a crucial component that prevents representation collapse, forcing the predictor to chase a fixed, rich target representation and driving genuine semantic learning.
  • πŸ› οΈ Architectural breakthroughs like 3D Rotary Position Embedding (3D RoPE) stabilize training for massive models by providing a robust relative sense of position across space and time.
  • πŸ“ˆ V-JEPA2's success was driven by four key ingredients: data scaling (2M to 22M videos), model scaling (billion-parameter encoder), longer training, and higher-resolution progressive training (8.4x GPU speedup).

V-JEPA2 AC: Robotics and Real-time Control

  • πŸ€– V-JEPA2 AC deploys the pre-trained V-JEPA visual encoder as a robot's visual cortex, then trains a new predictor to imagine future states based on current frames, robot state, and action vectors.
  • ⚠️ The rollout loss is essential for robotics, forcing the model to practice with its own predictions to prevent exponential error accumulation during long planning sequences.
  • ⚑ Using Model Predictive Control (MPC), the robot minimizes an energy function to pick actions, achieving 80% success in zero-shot manipulation in new environments.
  • ⏱️ This approach is significantly faster; V-JEPA2 AC plans an action in 16 seconds compared to Cosmos's four minutes, demonstrating massive efficiency gains by ignoring pixels.

VL-JEPA: Bridging Vision and Language

  • πŸ—£οΈ VL-JEPA integrates language, aligning the powerful visual encoder with human language for tasks like video question answering without modeling task-irrelevant linguistic features.
  • πŸ’‘ Instead of generating literal text tokens, VL-JEPA predicts the abstract semantic representation in the embedding space of the answer, effectively handling ambiguity (e.g., mapping
Knowledge graph40 entities Β· 25 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover Β· drag to explore
40 entities
Chapters2 moments

Key Moments

Transcript60 segments

Full Transcript

Topics15 themes

What’s Discussed

Joint Embedding Predictive Architectures (JEPA)World ModelsSelf-Supervised LearningRepresentation LearningV-JEPAVL-JEPARoboticsModel Predictive Control (MPC)Vision Transformers3D Rotary Position EmbeddingStop GradientContrastive LearningZero-Shot LearningAutonomous AgentsGenerative Models
Smart Objects40 Β· 25 links
ProductsΒ· 12
ConceptsΒ· 28