AI Scaling: From Tensor Programs to System 2 Reasoning with GDPO & GRPO

[HPP] Greg YangJanuary 29, 202618 min

30 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Evolution of AI Intelligence

💡 AI is transitioning from empirical heuristics (alchemy) to a rigorous mathematical science (engineering), moving beyond guessing what works.
🚀 This deep dive explores three layers: tensor program foundations, frontier architectures, and advanced reasoning models.

Tensor Programs and Scaling Laws

🧠 Tensors are viewed as multilinear maps (functions) that describe data flow, crucial for understanding scaling and hardware like TPUs.
🔑 Maximal Update Parametrization (μP), detailed in Greg Yang's Tensor Programs, solves the "lazy regime" problem of standard parameterization, ensuring models learn optimally.
✅ μP enables transferability of hyperparameters: settings tuned on small models work perfectly on massive, trillion-parameter models, transforming AI economics.

Frontier Architectures: DeepSeek V3

🧩 DeepSeek V3 utilizes Mixture of Experts (MoE), activating only a small subset of its 671 billion parameters per token for efficiency.
💡 It employs auxiliary loss-free load balancing for MoE, using a dynamic bias term to encourage expert utilization without performance penalties.
💾 Multi-head Latent Attention (MLA) addresses the KV cache memory bottleneck by compressing keys and values into a latent vector, like storing "cliff notes" instead of full text.

Advanced Reasoning with Reinforcement Learning

🎯 The goal is to move beyond next-token prediction to "System 2" intelligence, where models "think before they speak" using autonomous reasoning.
❌ Traditional Proximal Policy Optimization (PPO) is expensive, requiring a large "critic" model to grade the main model's output.
📈 Group Relative Policy Optimization (GRPO) eliminates the critic by grading outputs on a curve against a group of alternatives, significantly reducing memory.

GDPO: Multi-Objective Reasoning

⚠️ GRPO suffers from reward advantage collapse in multi-objective tasks, as summing rewards hides crucial signals for learning.
🔥 Group Reward Decoupled Normalization Policy Optimization (GDPO) fixes this by applying the "grading curve" to each reward individually before summing them.
📊 GDPO provides a high-resolution signal, enabling models to balance complex, conflicting goals and achieve significant accuracy gains (e.g., 6.3% on AIM benchmark).
🚀 This approach leads to process-oriented intelligence, where models simulate thought processes by generating internal reasoning steps.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 30 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters2 moments

Key Moments

Transcript69 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

System 2 ThinkingTensor ProgramsReasoning ArchitecturesGDPO (Group Reward Decoupled Normalization Policy Optimization)GRPO (Group Relative Policy Optimization)Reinforcement LearningMaximal Update Parametrization (μP)DeepSeek V3Mixture of Experts (MoE)Multi-head Latent Attention (MLA)AI ScalingNeural NetworksHyperparameter TransferabilityReward Advantage CollapseProcess-oriented Intelligence

Smart Objects40 · 30 links

Concepts· 34

Products· 3

Person· 1

Company· 1

Media· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free