AI Scaling: From Tensor Programs to System 2 Reasoning with GDPO & GRPO
[HPP] Greg YangJanuary 29, 202618 min
30 connections·40 entities in this video→The Evolution of AI Intelligence
- 💡 AI is transitioning from empirical heuristics (alchemy) to a rigorous mathematical science (engineering), moving beyond guessing what works.
- 🚀 This deep dive explores three layers: tensor program foundations, frontier architectures, and advanced reasoning models.
Tensor Programs and Scaling Laws
- 🧠 Tensors are viewed as multilinear maps (functions) that describe data flow, crucial for understanding scaling and hardware like TPUs.
- 🔑 Maximal Update Parametrization (μP), detailed in Greg Yang's Tensor Programs, solves the "lazy regime" problem of standard parameterization, ensuring models learn optimally.
- ✅ μP enables transferability of hyperparameters: settings tuned on small models work perfectly on massive, trillion-parameter models, transforming AI economics.
Frontier Architectures: DeepSeek V3
- 🧩 DeepSeek V3 utilizes Mixture of Experts (MoE), activating only a small subset of its 671 billion parameters per token for efficiency.
- 💡 It employs auxiliary loss-free load balancing for MoE, using a dynamic bias term to encourage expert utilization without performance penalties.
- 💾 Multi-head Latent Attention (MLA) addresses the KV cache memory bottleneck by compressing keys and values into a latent vector, like storing "cliff notes" instead of full text.
Advanced Reasoning with Reinforcement Learning
- 🎯 The goal is to move beyond next-token prediction to "System 2" intelligence, where models "think before they speak" using autonomous reasoning.
- ❌ Traditional Proximal Policy Optimization (PPO) is expensive, requiring a large "critic" model to grade the main model's output.
- 📈 Group Relative Policy Optimization (GRPO) eliminates the critic by grading outputs on a curve against a group of alternatives, significantly reducing memory.
GDPO: Multi-Objective Reasoning
- ⚠️ GRPO suffers from reward advantage collapse in multi-objective tasks, as summing rewards hides crucial signals for learning.
- 🔥 Group Reward Decoupled Normalization Policy Optimization (GDPO) fixes this by applying the "grading curve" to each reward individually before summing them.
- 📊 GDPO provides a high-resolution signal, enabling models to balance complex, conflicting goals and achieve significant accuracy gains (e.g., 6.3% on AIM benchmark).
- 🚀 This approach leads to process-oriented intelligence, where models simulate thought processes by generating internal reasoning steps.
Knowledge graph40 entities · 30 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover · drag to explore
40 entities
Chapters2 moments
Key Moments
Transcript69 segments
Full Transcript
Topics15 themes
What’s Discussed
System 2 ThinkingTensor ProgramsReasoning ArchitecturesGDPO (Group Reward Decoupled Normalization Policy Optimization)GRPO (Group Relative Policy Optimization)Reinforcement LearningMaximal Update Parametrization (μP)DeepSeek V3Mixture of Experts (MoE)Multi-head Latent Attention (MLA)AI ScalingNeural NetworksHyperparameter TransferabilityReward Advantage CollapseProcess-oriented Intelligence
Smart Objects40 · 30 links
Concepts· 34
Products· 3
Person· 1
Company· 1
Media· 1