Sherry Yang - Learning World Models and Physical Agents

[HPP] Percy LiangOctober 21, 202557 min

34 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Challenge of Physical Agents

⚠️ Learning physical agents is difficult due to the high cost of real-world robot interactions, including time, money, and safety risks.
💡 In contrast, agents in low-cost virtual environments (like Go or LLMs for coding) can achieve superhuman performance because they can learn from extensive interaction.

Advances in World Model Learning

🧠 A world model is a dynamics model that predicts future frames based on current observations and actions, acting as a learned simulator.
🚀 Recent progress is driven by internet-scale video data and scalable video generation architectures (e.g., transformers, latent diffusion models).
✅ These models can integrate diverse data (simulated, real robot, human egocentric) and enable controllable video generation for various tasks and actions.

Evaluating Policies with World Models

📊 World models offer a cheap, safe, and reproducible way to evaluate robot policies, overcoming limitations of real-world and traditional simulated evaluations.
🤖 Policies are rolled out in the world model, and a Vision-Language Model (VLM) acts as a reward model to assess task success.
🔬 This approach allows for out-of-distribution testing using image editing tools to introduce novel objects or distractors, revealing policy robustness and generalization gaps (e.g., issues with shapes vs. colors).

Improving Policies through RL and Planning

📈 World models facilitate reinforcement learning (RL) by providing a low-cost environment for policy optimization using VLM-derived rewards.
💡 They also enable hierarchical planning, where complex tasks are broken down into language-guided sub-steps, with the world model generating videos for each step.
🎯 This approach leverages internet-scale supervision for high-level planning, allowing for more effective sim-to-real transfer and knowledge sharing across different robot morphologies.

Future Directions and Challenges

🔍 Key challenges include reducing hallucinations and determining the optimal temporal and spatial resolution for world models to be useful for downstream tasks.
🌱 Further work is needed to effectively utilize imperfect world models and scale up data collection to enhance their realism and robustness.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 34 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters19 moments

Key Moments

Transcript216 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

World ModelsPhysical AgentsReinforcement LearningGenerative ModelingRoboticsInternet-Scale Video DataVideo Generation ArchitecturesConditional Video GenerationPolicy EvaluationVision-Language Models (VLM)Sim-to-Real TransferHierarchical PlanningLow-Level Robot ControlsOut-of-Distribution TestingBehavioral Cloning

Smart Objects40 · 34 links

Concepts· 13

People· 6

Companies· 10

Products· 6

Medias· 3

Events· 2

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free