Skip to main content

Robot Foundation Models: Sergey Levine's PyTorch 2025 Keynote

[HPP] Sergey LevineOctober 28, 202514 min
25 connections·40 entities in this video

Early Vision-Language-Action Models

  • 💡 The RT2 model was the first vision-language-action (VLA) model, adapting multimodal LLM designs for robotic control by framing actions as numerical answers to visual questions.
  • 🧠 This approach allowed for the transfer of semantic knowledge from large vision-language models into the robotic domain, leveraging existing understanding of the visual world.
  • ⚠️ Initial VLA designs were limited to rudimentary tasks and used coarse discretization of actions, neglecting the continuous nature of robotic movements.

The Power of Diverse Datasets

  • 📊 The RTX project was a significant effort to create diverse, cross-embodiment robotic datasets, collecting data from 34 labs and 22 robot types.
  • 🚀 This large-scale data enabled general-purpose robotic models to outperform specialized models in their own domains, a breakthrough not previously seen in robotics.
  • ✅ The blessing of generality observed with foundation models in other fields (like LLMs) began to manifest in robotics through these diverse datasets.

Advancements in VLA Architecture

  • 🛠️ Second-generation VLAs move beyond simple adaptation by adding a dedicated "motor cortex" or action decoder to the LLM backbone.
  • ⚙️ This component specializes in producing smooth, continuous, and dexterous actions, often trained with methods like diffusion or flow matching, allowing for more sophisticated control.

PiZero: Complex Task Execution

  • 🤖 The PiZero model exemplifies second-generation VLAs, pre-trained on 10,000 hours of diverse robotic data.
  • ✨ It can perform intricate tasks like folding laundry and exhibits emergent behavior, such as correcting its own mistakes when encountering unfamiliar states.
  • 🧩 PiZero's ability to recover from perturbations highlights the power of large datasets and pre-training/post-training formulas.

PiO5: Reasoning and Verbal Instruction

  • 🧠 The PiO5 model extends VLA capabilities by incorporating an internal reasoning mechanism for long-horizon tasks, like cleaning a bedroom in an unfamiliar house.
  • 💬 It uses a chain-of-thought reasoning step, generating internal language commands before executing actions.
  • 🗣️ PiO5 can improve through verbal instructions, allowing humans to coach the robot with language, reducing the need for direct action supervision.

Future of Robot Foundation Models

  • 📈 While showing promising generalization across platforms and environments, current models are still trained primarily through imitation.
  • 🎯 Future work needs to focus on optimizing for task success, robustness, and speed, alongside more sophisticated planning and generalization capabilities.
  • 💻 Both PiZero and PiO5 models are available with a PyTorch port for researchers to experiment with.
Knowledge graph40 entities · 25 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
40 entities
Chapters8 moments

Key Moments

Transcript54 segments

Full Transcript

Topics15 themes

What’s Discussed

Robot Foundation ModelsVision-Language-Action (VLA) modelsRT2 modelMultimodal LLMsRobotic ControlCross-embodiment DatasetsGeneral-purpose Robotic ModelsSecond-generation VLAsAction DecodersPiZero ModelPre-trainingPiO5 ModelReasoning MechanismsVerbal InstructionsGeneralization
Smart Objects40 · 25 links
Products· 9
Company· 1
Concepts· 26
People· 2
Event· 1
Location· 1