Skip to main content

Stanford Seminar - Generalization through Task Representations with Foundation Models

[HPP] Percy LiangJuly 14, 202528 min
24 connections·40 entities in this video

The Challenge of Robotic Generalization

  • 🎯 The north star in robotics is building autonomous robots for unstructured environments that follow natural language commands.
  • ⚠️ Despite progress in visuomotor policies and open-world deployment, task-level generalization remains an unsolved problem.
  • 🧠 A fundamental question is "What is a task and how to represent it?", especially for complex, interdependent household manipulation tasks.

Evolving Task Representations

  • 💬 Early work used natural language with LLMs for planning, but robotic tasks are inherently spatial, not just textual.
  • 💡 3D value maps (Box-Poster) ground tasks in 3D space, using foundation models to generate code for constructing voxel-based value maps for goals and obstacles.
  • 🚀 4D space-time representations (Recap) define tasks as sequences of relational keypoint constraints, enabling spatial and temporal composition, closed-loop control, and backtracking for complex actions like pouring.

Affordance and Demonstration-Based Learning

  • 🔍 Affordance-based representations identify action possibilities at the pixel level, using VLMs to brainstorm tasks and match them to object regions for autonomous data collection.
  • 🌱 This approach allows a single model to generalize to unseen objects, poses, and task instructions in real-world scenes.
  • Demonstration-based learning leverages foundation models to decompose human interaction trajectories into fine-grained, composable data for both spatial and temporal dimensions, enabling generalization across varying task parameters.

A New Paradigm for Robotic Intelligence

  • 📈 An alternative view proposes leveraging foundation models to provide task-specific knowledge (like a reward function) for semantic understanding.
  • 🌍 This combines with task-agnostic world modeling from interactions in the physical world, aiming for robots that understand and act with purpose and generality.
  • 🛠️ The discussion highlights planning and optimization as a way to derive actions from environmental models and objectives, complementing data-driven prediction.
Knowledge graph40 entities · 24 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
40 entities
Chapters13 moments

Key Moments

Transcript103 segments

Full Transcript

Topics15 themes

What’s Discussed

GeneralizationTask RepresentationsFoundation ModelsRobotic ManipulationVision Language Models (VLMs)Large Language Models (LLMs)3D Value MapsKeypoint RelationsAffordance LearningModel-Based PlanningImitation LearningWorld ModelingZero-Shot GeneralizationMotion PlanningSemantic Understanding
Smart Objects40 · 24 links
People· 2
Concepts· 29
Companies· 3
Products· 2
Medias· 4