Skip to main content

AI Scaling: From Tensor Programs to System 2 Reasoning with GDPO & GRPO

[HPP] Greg YangJanuary 29, 202618 min
30 connections·40 entities in this video

The Evolution of AI Intelligence

  • 💡 AI is transitioning from empirical heuristics (alchemy) to a rigorous mathematical science (engineering), moving beyond guessing what works.
  • 🚀 This deep dive explores three layers: tensor program foundations, frontier architectures, and advanced reasoning models.

Tensor Programs and Scaling Laws

  • 🧠 Tensors are viewed as multilinear maps (functions) that describe data flow, crucial for understanding scaling and hardware like TPUs.
  • 🔑 Maximal Update Parametrization (μP), detailed in Greg Yang's Tensor Programs, solves the "lazy regime" problem of standard parameterization, ensuring models learn optimally.
  • ✅ μP enables transferability of hyperparameters: settings tuned on small models work perfectly on massive, trillion-parameter models, transforming AI economics.

Frontier Architectures: DeepSeek V3

  • 🧩 DeepSeek V3 utilizes Mixture of Experts (MoE), activating only a small subset of its 671 billion parameters per token for efficiency.
  • 💡 It employs auxiliary loss-free load balancing for MoE, using a dynamic bias term to encourage expert utilization without performance penalties.
  • 💾 Multi-head Latent Attention (MLA) addresses the KV cache memory bottleneck by compressing keys and values into a latent vector, like storing "cliff notes" instead of full text.

Advanced Reasoning with Reinforcement Learning

  • 🎯 The goal is to move beyond next-token prediction to "System 2" intelligence, where models "think before they speak" using autonomous reasoning.
  • ❌ Traditional Proximal Policy Optimization (PPO) is expensive, requiring a large "critic" model to grade the main model's output.
  • 📈 Group Relative Policy Optimization (GRPO) eliminates the critic by grading outputs on a curve against a group of alternatives, significantly reducing memory.

GDPO: Multi-Objective Reasoning

  • ⚠️ GRPO suffers from reward advantage collapse in multi-objective tasks, as summing rewards hides crucial signals for learning.
  • 🔥 Group Reward Decoupled Normalization Policy Optimization (GDPO) fixes this by applying the "grading curve" to each reward individually before summing them.
  • 📊 GDPO provides a high-resolution signal, enabling models to balance complex, conflicting goals and achieve significant accuracy gains (e.g., 6.3% on AIM benchmark).
  • 🚀 This approach leads to process-oriented intelligence, where models simulate thought processes by generating internal reasoning steps.
Knowledge graph40 entities · 30 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
40 entities
Chapters2 moments

Key Moments

Transcript69 segments

Full Transcript

Topics15 themes

What’s Discussed

System 2 ThinkingTensor ProgramsReasoning ArchitecturesGDPO (Group Reward Decoupled Normalization Policy Optimization)GRPO (Group Relative Policy Optimization)Reinforcement LearningMaximal Update Parametrization (μP)DeepSeek V3Mixture of Experts (MoE)Multi-head Latent Attention (MLA)AI ScalingNeural NetworksHyperparameter TransferabilityReward Advantage CollapseProcess-oriented Intelligence
Smart Objects40 · 30 links
Concepts· 34
Products· 3
Person· 1
Company· 1
Media· 1