Robot Foundation Models: Sergey Levine's PyTorch 2025 Keynote
[HPP] Sergey LevineOctober 28, 202514 min
25 connections·40 entities in this video→Early Vision-Language-Action Models
- 💡 The RT2 model was the first vision-language-action (VLA) model, adapting multimodal LLM designs for robotic control by framing actions as numerical answers to visual questions.
- 🧠 This approach allowed for the transfer of semantic knowledge from large vision-language models into the robotic domain, leveraging existing understanding of the visual world.
- ⚠️ Initial VLA designs were limited to rudimentary tasks and used coarse discretization of actions, neglecting the continuous nature of robotic movements.
The Power of Diverse Datasets
- 📊 The RTX project was a significant effort to create diverse, cross-embodiment robotic datasets, collecting data from 34 labs and 22 robot types.
- 🚀 This large-scale data enabled general-purpose robotic models to outperform specialized models in their own domains, a breakthrough not previously seen in robotics.
- ✅ The blessing of generality observed with foundation models in other fields (like LLMs) began to manifest in robotics through these diverse datasets.
Advancements in VLA Architecture
- 🛠️ Second-generation VLAs move beyond simple adaptation by adding a dedicated "motor cortex" or action decoder to the LLM backbone.
- ⚙️ This component specializes in producing smooth, continuous, and dexterous actions, often trained with methods like diffusion or flow matching, allowing for more sophisticated control.
PiZero: Complex Task Execution
- 🤖 The PiZero model exemplifies second-generation VLAs, pre-trained on 10,000 hours of diverse robotic data.
- ✨ It can perform intricate tasks like folding laundry and exhibits emergent behavior, such as correcting its own mistakes when encountering unfamiliar states.
- 🧩 PiZero's ability to recover from perturbations highlights the power of large datasets and pre-training/post-training formulas.
PiO5: Reasoning and Verbal Instruction
- 🧠 The PiO5 model extends VLA capabilities by incorporating an internal reasoning mechanism for long-horizon tasks, like cleaning a bedroom in an unfamiliar house.
- 💬 It uses a chain-of-thought reasoning step, generating internal language commands before executing actions.
- 🗣️ PiO5 can improve through verbal instructions, allowing humans to coach the robot with language, reducing the need for direct action supervision.
Future of Robot Foundation Models
- 📈 While showing promising generalization across platforms and environments, current models are still trained primarily through imitation.
- 🎯 Future work needs to focus on optimizing for task success, robustness, and speed, alongside more sophisticated planning and generalization capabilities.
- 💻 Both PiZero and PiO5 models are available with a PyTorch port for researchers to experiment with.
Knowledge graph40 entities · 25 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover · drag to explore
40 entities
Chapters8 moments
Key Moments
Transcript54 segments
Full Transcript
Topics15 themes
What’s Discussed
Robot Foundation ModelsVision-Language-Action (VLA) modelsRT2 modelMultimodal LLMsRobotic ControlCross-embodiment DatasetsGeneral-purpose Robotic ModelsSecond-generation VLAsAction DecodersPiZero ModelPre-trainingPiO5 ModelReasoning MechanismsVerbal InstructionsGeneralization
Smart Objects40 · 25 links
Products· 9
Company· 1
Concepts· 26
People· 2
Event· 1
Location· 1