Stanford Seminar - Generalization through Task Representations with Foundation Models

[HPP] Percy LiangJuly 14, 202528 min

24 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Challenge of Robotic Generalization

🎯 The north star in robotics is building autonomous robots for unstructured environments that follow natural language commands.
⚠️ Despite progress in visuomotor policies and open-world deployment, task-level generalization remains an unsolved problem.
🧠 A fundamental question is "What is a task and how to represent it?", especially for complex, interdependent household manipulation tasks.

Evolving Task Representations

💬 Early work used natural language with LLMs for planning, but robotic tasks are inherently spatial, not just textual.
💡 3D value maps (Box-Poster) ground tasks in 3D space, using foundation models to generate code for constructing voxel-based value maps for goals and obstacles.
🚀 4D space-time representations (Recap) define tasks as sequences of relational keypoint constraints, enabling spatial and temporal composition, closed-loop control, and backtracking for complex actions like pouring.

Affordance and Demonstration-Based Learning

🔍 Affordance-based representations identify action possibilities at the pixel level, using VLMs to brainstorm tasks and match them to object regions for autonomous data collection.
🌱 This approach allows a single model to generalize to unseen objects, poses, and task instructions in real-world scenes.
✅ Demonstration-based learning leverages foundation models to decompose human interaction trajectories into fine-grained, composable data for both spatial and temporal dimensions, enabling generalization across varying task parameters.

A New Paradigm for Robotic Intelligence

📈 An alternative view proposes leveraging foundation models to provide task-specific knowledge (like a reward function) for semantic understanding.
🌍 This combines with task-agnostic world modeling from interactions in the physical world, aiming for robots that understand and act with purpose and generality.
🛠️ The discussion highlights planning and optimization as a way to derive actions from environmental models and objectives, complementing data-driven prediction.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 24 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters13 moments

Key Moments

Transcript103 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

GeneralizationTask RepresentationsFoundation ModelsRobotic ManipulationVision Language Models (VLMs)Large Language Models (LLMs)3D Value MapsKeypoint RelationsAffordance LearningModel-Based PlanningImitation LearningWorld ModelingZero-Shot GeneralizationMotion PlanningSemantic Understanding

Smart Objects40 · 24 links

People· 2

Concepts· 29

Companies· 3

Products· 2

Medias· 4

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free