Actuate 2025: Robotic Foundation Models with Liyiming Ke & Sergey Levine

[HPP] Sergey LevineNovember 12, 202525 min

27 connections·40 entities in this video→

The Promise of Generalist Robotic Models

💡 Large Language Models (LLMs) demonstrate power through their generality and ability to leverage web-scale knowledge for diverse problems.
🎯 The core goal is to extend this generalist capability to robotics, enabling a single model to control any robot for any task.
🔑 Early Vision-Language-Action (VLA) models adapted multimodal LLMs (VLMs) for robotic control by framing tasks as a question-answering problem.
✨ These first-generation VLAs could execute simple cognitive tasks and follow out-of-distribution instructions, showing initial promise.

Data Scaling and Generalization

📊 The RTX project highlighted the importance of aggregating data from many different robots and academic labs to overcome data limitations.
🚀 This research provided evidence that generalist models can surpass specialist models in robotics, mirroring trends seen with LLMs.
✅ Utilizing cross-embodiment data significantly enhanced the models' instruction following capabilities and success rates for various tasks.

Evolving VLA Architectures

⚠️ Initial VLAs were constrained by directly adapting VLM architectures, which were not optimized for high-frequency dexterous control in robotics.
🛠️ Second-generation VLAs, such as PI0ER, introduced a specialized "motor cortex" module designed for continuous, high-frequency action generation.
🧠 The PI05 model employs a sophisticated pre-training and post-training paradigm, leveraging diverse data for pre-training and fine-tuning for high-level reasoning and complex command execution.

Scaling Data Collection and Real-World Application

🤖 Teleoperation is a key method for collecting high-quality, complex, and dexterous task data from human operators.
📈 Data collection rapidly scaled from 3,800 hours to over 10,000 hours, expanding to include mobile manipulation in diverse real-world home environments.
🏡 Operating in uncontrolled real-world settings (e.g., homes with varied layouts) drastically increases data diversity and task complexity, pushing model capabilities.
🔄 A two-stage training approach, combining diverse pre-training with task-relevant post-training, is critical for achieving generalization to unseen environments.

Demonstrating Long-Horizon Autonomy

🌟 The PI05 model exhibits the first signs of generalization to new, unfamiliar environments, such as cleaning a kitchen in a never-before-seen home.
🔍 Research suggests that data from approximately 100 distinct homes may be sufficient to achieve robust generalization in test environments.
💡 Pre-training with a diverse mix of data, even if not directly related to the target task (e.g., non-mobile data for mobile tasks), is essential for superior performance and generalization.
💬 The models support natural language interaction, allowing them to decompose high-level commands into sub-commands, facilitating human-robot collaboration and prompt engineering for complex tasks.
⏳ PI05 marks a significant milestone, extending autonomous operation from 1-2 minutes to up to 10 minutes for challenging, long-horizon tasks.

Knowledge graph40 entities · 27 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Ask, don't scrub

Have a conversation with this video.

VERIDIVE answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Chapters12 moments

Key Moments

Transcript93 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

VERIDIVE maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics14 themes

What’s Discussed

Robotic Foundation ModelsLarge Language Models (LLMs)Vision-Language-Action (VLA) ModelsMultimodal LLMs (VLMs)Robotic ControlData AggregationGeneralist AI ModelsCross-Embodiment DataPre-training and Post-trainingTeleoperationMobile ManipulationLong-Horizon AutonomyNatural Language InteractionAI Generalization

Smart Objects40 · 27 links

Products· 4

Concepts· 30

Company· 1

Medias· 4

Event· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free