Skip to main content

Embodied Multimodal Intelligence with Robot Foundation Models

[HPP] Sergey LevineDecember 20, 202556 min
33 connections·40 entities in this video

Developing Generalist Robot Policies

  • 💡 The research goal is to create generalist robot policies that enable intelligent autonomy for diverse applications, such as home assistant robots working alongside humans.
  • 🚀 This approach is inspired by the paradigm shift in NLP and vision, where large models trained on diverse datasets have surpassed specialist models.

Challenges and Data Solutions

  • ⚠️ Robotics faces unique challenges: scarce, expensive, and heterogeneous multimodal data, requiring real-time operation and complex evaluation procedures.
  • 🤝 The Open-X Embodiment Dataset was created by aggregating over a million real robot episodes from 34 labs across 22 robot embodiments, providing a crucial foundation for training generalist models.

Octo and Cross-Embodied Learning

  • 🤖 Octo is an open-source, transformer-based generalist policy that views robotics as a multimodal sequence prediction problem, mapping language and image tokens to robotic actions.
  • 🎯 Crossformer extends Octo to control a wider range of robot morphologies, including bimanual manipulation, navigation, and locomotion, using a single model checkpoint for vastly different robots.

Enhancing Reasoning and Multimodality

  • 🧠 Embodied Chain of Thought introduces reasoning steps (language and visual) before action prediction, significantly boosting generalization by 30% without additional robot data and improving interpretability.
  • 👂 To address limitations of vision-only systems, a new recipe allows fine-tuning generalist policies on heterogeneous sensor data like touch and audio, enabling cross-modal reasoning for complex tasks.

Scalable Evaluation Methods

  • 📊 Calvin is a widely adopted public benchmark for long-horizon instruction following in simulation, driving algorithmic progress and prototyping.
  • Simpler leverages simulated evaluations for policies trained on real-world data, aiming to establish a correlation between sim and real-world performance to reduce evaluation costs and improve reproducibility.
Knowledge graph40 entities · 33 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
40 entities
Chapters19 moments

Key Moments

Transcript207 segments

Full Transcript

Topics15 themes

What’s Discussed

Embodied Multimodal IntelligenceFoundation ModelsGeneralist Robot PoliciesRobot LearningOpen-X Embodiment DatasetOcto (Robot Policy)Cross-Embodied LearningTransformer ModelsEmbodied Chain of ThoughtIntelligent ReasoningMultisensory IntegrationSimulation-to-Real TransferRobot EvaluationVideo Prediction ModelsSpatial Intelligence
Smart Objects40 · 33 links
People· 3
Products· 11
Companies· 5
Concepts· 13
Medias· 4
Location· 1
Events· 3