Actuate 2025: Robotic Foundation Models with Liyiming Ke & Sergey Levine
[HPP] Sergey LevineNovember 12, 202525 min
27 connectionsΒ·40 entities in this videoβThe Promise of Generalist Robotic Models
- π‘ Large Language Models (LLMs) demonstrate power through their generality and ability to leverage web-scale knowledge for diverse problems.
- π― The core goal is to extend this generalist capability to robotics, enabling a single model to control any robot for any task.
- π Early Vision-Language-Action (VLA) models adapted multimodal LLMs (VLMs) for robotic control by framing tasks as a question-answering problem.
- β¨ These first-generation VLAs could execute simple cognitive tasks and follow out-of-distribution instructions, showing initial promise.
Data Scaling and Generalization
- π The RTX project highlighted the importance of aggregating data from many different robots and academic labs to overcome data limitations.
- π This research provided evidence that generalist models can surpass specialist models in robotics, mirroring trends seen with LLMs.
- β Utilizing cross-embodiment data significantly enhanced the models' instruction following capabilities and success rates for various tasks.
Evolving VLA Architectures
- β οΈ Initial VLAs were constrained by directly adapting VLM architectures, which were not optimized for high-frequency dexterous control in robotics.
- π οΈ Second-generation VLAs, such as PI0ER, introduced a specialized "motor cortex" module designed for continuous, high-frequency action generation.
- π§ The PI05 model employs a sophisticated pre-training and post-training paradigm, leveraging diverse data for pre-training and fine-tuning for high-level reasoning and complex command execution.
Scaling Data Collection and Real-World Application
- π€ Teleoperation is a key method for collecting high-quality, complex, and dexterous task data from human operators.
- π Data collection rapidly scaled from 3,800 hours to over 10,000 hours, expanding to include mobile manipulation in diverse real-world home environments.
- π‘ Operating in uncontrolled real-world settings (e.g., homes with varied layouts) drastically increases data diversity and task complexity, pushing model capabilities.
- π A two-stage training approach, combining diverse pre-training with task-relevant post-training, is critical for achieving generalization to unseen environments.
Demonstrating Long-Horizon Autonomy
- π The PI05 model exhibits the first signs of generalization to new, unfamiliar environments, such as cleaning a kitchen in a never-before-seen home.
- π Research suggests that data from approximately 100 distinct homes may be sufficient to achieve robust generalization in test environments.
- π‘ Pre-training with a diverse mix of data, even if not directly related to the target task (e.g., non-mobile data for mobile tasks), is essential for superior performance and generalization.
- π¬ The models support natural language interaction, allowing them to decompose high-level commands into sub-commands, facilitating human-robot collaboration and prompt engineering for complex tasks.
- β³ PI05 marks a significant milestone, extending autonomous operation from 1-2 minutes to up to 10 minutes for challenging, long-horizon tasks.
Knowledge graph40 entities Β· 27 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover Β· drag to explore
40 entities
Chapters12 moments
Key Moments
Transcript93 segments
Full Transcript
Topics14 themes
Whatβs Discussed
Robotic Foundation ModelsLarge Language Models (LLMs)Vision-Language-Action (VLA) ModelsMultimodal LLMs (VLMs)Robotic ControlData AggregationGeneralist AI ModelsCross-Embodiment DataPre-training and Post-trainingTeleoperationMobile ManipulationLong-Horizon AutonomyNatural Language InteractionAI Generalization
Smart Objects40 Β· 27 links
ProductsΒ· 4
ConceptsΒ· 30
CompanyΒ· 1
MediasΒ· 4
EventΒ· 1