Embodied Multimodal Intelligence with Robot Foundation Models
[HPP] Sergey LevineDecember 20, 202556 min
33 connections·40 entities in this video→Developing Generalist Robot Policies
- 💡 The research goal is to create generalist robot policies that enable intelligent autonomy for diverse applications, such as home assistant robots working alongside humans.
- 🚀 This approach is inspired by the paradigm shift in NLP and vision, where large models trained on diverse datasets have surpassed specialist models.
Challenges and Data Solutions
- ⚠️ Robotics faces unique challenges: scarce, expensive, and heterogeneous multimodal data, requiring real-time operation and complex evaluation procedures.
- 🤝 The Open-X Embodiment Dataset was created by aggregating over a million real robot episodes from 34 labs across 22 robot embodiments, providing a crucial foundation for training generalist models.
Octo and Cross-Embodied Learning
- 🤖 Octo is an open-source, transformer-based generalist policy that views robotics as a multimodal sequence prediction problem, mapping language and image tokens to robotic actions.
- 🎯 Crossformer extends Octo to control a wider range of robot morphologies, including bimanual manipulation, navigation, and locomotion, using a single model checkpoint for vastly different robots.
Enhancing Reasoning and Multimodality
- 🧠 Embodied Chain of Thought introduces reasoning steps (language and visual) before action prediction, significantly boosting generalization by 30% without additional robot data and improving interpretability.
- 👂 To address limitations of vision-only systems, a new recipe allows fine-tuning generalist policies on heterogeneous sensor data like touch and audio, enabling cross-modal reasoning for complex tasks.
Scalable Evaluation Methods
- 📊 Calvin is a widely adopted public benchmark for long-horizon instruction following in simulation, driving algorithmic progress and prototyping.
- ✅ Simpler leverages simulated evaluations for policies trained on real-world data, aiming to establish a correlation between sim and real-world performance to reduce evaluation costs and improve reproducibility.
Knowledge graph40 entities · 33 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover · drag to explore
40 entities
Chapters19 moments
Key Moments
Transcript207 segments
Full Transcript
Topics15 themes
What’s Discussed
Embodied Multimodal IntelligenceFoundation ModelsGeneralist Robot PoliciesRobot LearningOpen-X Embodiment DatasetOcto (Robot Policy)Cross-Embodied LearningTransformer ModelsEmbodied Chain of ThoughtIntelligent ReasoningMultisensory IntegrationSimulation-to-Real TransferRobot EvaluationVideo Prediction ModelsSpatial Intelligence
Smart Objects40 · 33 links
People· 3
Products· 11
Companies· 5
Concepts· 13
Medias· 4
Location· 1
Events· 3