Embodied Multimodal Intelligence with Robot Foundation Models

[HPP] Sergey LevineDecember 20, 202556 min

33 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Developing Generalist Robot Policies

💡 The research goal is to create generalist robot policies that enable intelligent autonomy for diverse applications, such as home assistant robots working alongside humans.
🚀 This approach is inspired by the paradigm shift in NLP and vision, where large models trained on diverse datasets have surpassed specialist models.

Challenges and Data Solutions

⚠️ Robotics faces unique challenges: scarce, expensive, and heterogeneous multimodal data, requiring real-time operation and complex evaluation procedures.
🤝 The Open-X Embodiment Dataset was created by aggregating over a million real robot episodes from 34 labs across 22 robot embodiments, providing a crucial foundation for training generalist models.

Octo and Cross-Embodied Learning

🤖 Octo is an open-source, transformer-based generalist policy that views robotics as a multimodal sequence prediction problem, mapping language and image tokens to robotic actions.
🎯 Crossformer extends Octo to control a wider range of robot morphologies, including bimanual manipulation, navigation, and locomotion, using a single model checkpoint for vastly different robots.

Enhancing Reasoning and Multimodality

🧠 Embodied Chain of Thought introduces reasoning steps (language and visual) before action prediction, significantly boosting generalization by 30% without additional robot data and improving interpretability.
👂 To address limitations of vision-only systems, a new recipe allows fine-tuning generalist policies on heterogeneous sensor data like touch and audio, enabling cross-modal reasoning for complex tasks.

Scalable Evaluation Methods

📊 Calvin is a widely adopted public benchmark for long-horizon instruction following in simulation, driving algorithmic progress and prototyping.
✅ Simpler leverages simulated evaluations for policies trained on real-world data, aiming to establish a correlation between sim and real-world performance to reduce evaluation costs and improve reproducibility.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 33 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters19 moments

Key Moments

Transcript207 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Embodied Multimodal IntelligenceFoundation ModelsGeneralist Robot PoliciesRobot LearningOpen-X Embodiment DatasetOcto (Robot Policy)Cross-Embodied LearningTransformer ModelsEmbodied Chain of ThoughtIntelligent ReasoningMultisensory IntegrationSimulation-to-Real TransferRobot EvaluationVideo Prediction ModelsSpatial Intelligence

Smart Objects40 · 33 links

People· 3

Products· 11

Companies· 5

Concepts· 13

Medias· 4

Location· 1

Events· 3

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free