Robot Foundation Models: Sergey Levine's PyTorch 2025 Keynote

[HPP] Sergey LevineOctober 28, 202514 min

25 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Early Vision-Language-Action Models

💡 The RT2 model was the first vision-language-action (VLA) model, adapting multimodal LLM designs for robotic control by framing actions as numerical answers to visual questions.
🧠 This approach allowed for the transfer of semantic knowledge from large vision-language models into the robotic domain, leveraging existing understanding of the visual world.
⚠️ Initial VLA designs were limited to rudimentary tasks and used coarse discretization of actions, neglecting the continuous nature of robotic movements.

The Power of Diverse Datasets

📊 The RTX project was a significant effort to create diverse, cross-embodiment robotic datasets, collecting data from 34 labs and 22 robot types.
🚀 This large-scale data enabled general-purpose robotic models to outperform specialized models in their own domains, a breakthrough not previously seen in robotics.
✅ The blessing of generality observed with foundation models in other fields (like LLMs) began to manifest in robotics through these diverse datasets.

Advancements in VLA Architecture

🛠️ Second-generation VLAs move beyond simple adaptation by adding a dedicated "motor cortex" or action decoder to the LLM backbone.
⚙️ This component specializes in producing smooth, continuous, and dexterous actions, often trained with methods like diffusion or flow matching, allowing for more sophisticated control.

PiZero: Complex Task Execution

🤖 The PiZero model exemplifies second-generation VLAs, pre-trained on 10,000 hours of diverse robotic data.
✨ It can perform intricate tasks like folding laundry and exhibits emergent behavior, such as correcting its own mistakes when encountering unfamiliar states.
🧩 PiZero's ability to recover from perturbations highlights the power of large datasets and pre-training/post-training formulas.

PiO5: Reasoning and Verbal Instruction

🧠 The PiO5 model extends VLA capabilities by incorporating an internal reasoning mechanism for long-horizon tasks, like cleaning a bedroom in an unfamiliar house.
💬 It uses a chain-of-thought reasoning step, generating internal language commands before executing actions.
🗣️ PiO5 can improve through verbal instructions, allowing humans to coach the robot with language, reducing the need for direct action supervision.

Future of Robot Foundation Models

📈 While showing promising generalization across platforms and environments, current models are still trained primarily through imitation.
🎯 Future work needs to focus on optimizing for task success, robustness, and speed, alongside more sophisticated planning and generalization capabilities.
💻 Both PiZero and PiO5 models are available with a PyTorch port for researchers to experiment with.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 25 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters8 moments

Key Moments

Transcript54 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Robot Foundation ModelsVision-Language-Action (VLA) modelsRT2 modelMultimodal LLMsRobotic ControlCross-embodiment DatasetsGeneral-purpose Robotic ModelsSecond-generation VLAsAction DecodersPiZero ModelPre-trainingPiO5 ModelReasoning MechanismsVerbal InstructionsGeneralization

Smart Objects40 · 25 links

Products· 9

Company· 1

Concepts· 26

People· 2

Event· 1

Location· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free