JEPA Models Explained: I-JEPA, V-JEPA, VL-JEPA & Real-World AI Learning

[HPP] Yann LeCunDecember 31, 202516 min

25 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The JEPA Philosophy: Learning by Observation

💡 Joint Embedding Predictive Architectures (JEPA) represent a critical pivot in model design, moving away from pixel-perfect generation.
🧠 The core idea is for AI models to learn by observing and understanding the world, similar to how humans and animals learn, rather than painstakingly reconstructing every detail.
🔑 Championed by researchers like Yann LeCun, this approach aims to build reliable internal world models through passive observation, leading to vastly more efficient and scalable AI.

Abstract Representations vs. Raw Output

🎯 JEPA's key advantage is focusing on the predictable signal and crucially ignoring noise, unlike traditional generative models (e.g., diffusion models or Nvidia Cosmos) that must model every single detail.
🗑️ Generative models waste computational cycles by trying to predict unpredictable, high-frequency details (like individual blades of grass or light flickers), which are often irrelevant to the task.
✅ JEPA learns representations only for the predictable parts of a scene, such as object trajectories or main motion, making the learning objective simpler and more stable for scaling.

V-JEPA2: Scaling for Robustness

🚀 V-JEPA2 utilizes a mask denoising feature prediction objective to infer missing high-level structure in video by predicting learned representations, not low-level details.
🛑 The stop gradient is a crucial component that prevents representation collapse, forcing the predictor to chase a fixed, rich target representation and driving genuine semantic learning.
🛠️ Architectural breakthroughs like 3D Rotary Position Embedding (3D RoPE) stabilize training for massive models by providing a robust relative sense of position across space and time.
📈 V-JEPA2's success was driven by four key ingredients: data scaling (2M to 22M videos), model scaling (billion-parameter encoder), longer training, and higher-resolution progressive training (8.4x GPU speedup).

V-JEPA2 AC: Robotics and Real-time Control

🤖 V-JEPA2 AC deploys the pre-trained V-JEPA visual encoder as a robot's visual cortex, then trains a new predictor to imagine future states based on current frames, robot state, and action vectors.
⚠️ The rollout loss is essential for robotics, forcing the model to practice with its own predictions to prevent exponential error accumulation during long planning sequences.
⚡ Using Model Predictive Control (MPC), the robot minimizes an energy function to pick actions, achieving 80% success in zero-shot manipulation in new environments.
⏱️ This approach is significantly faster; V-JEPA2 AC plans an action in 16 seconds compared to Cosmos's four minutes, demonstrating massive efficiency gains by ignoring pixels.

VL-JEPA: Bridging Vision and Language

🗣️ VL-JEPA integrates language, aligning the powerful visual encoder with human language for tasks like video question answering without modeling task-irrelevant linguistic features.
💡 Instead of generating literal text tokens, VL-JEPA predicts the abstract semantic representation in the embedding space of the answer, effectively handling ambiguity (e.g., mapping

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 25 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters2 moments

Key Moments

Transcript60 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Joint Embedding Predictive Architectures (JEPA)World ModelsSelf-Supervised LearningRepresentation LearningV-JEPAVL-JEPARoboticsModel Predictive Control (MPC)Vision Transformers3D Rotary Position EmbeddingStop GradientContrastive LearningZero-Shot LearningAutonomous AgentsGenerative Models

Smart Objects40 · 25 links

Products· 12

Concepts· 28

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free