GDPO: Decoupling Rewards for Stable Multi-Reward RL in LLMs

[HPP] Yejin ChoiJanuary 10, 202613 min

38 connections·33 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Challenge of Multi-Reward RL

🧠 Large Language Models (LLMs) need to follow multiple, often conflicting, rules simultaneously, such as accuracy, politeness, specific formats, and length constraints.
⚠️ Standard reinforcement learning (RL) methods struggle when models are presented with diverse and sometimes contradictory goals.

GRPO's Reward Signal Collapse

📉 The previous method, Group Relative Policy Optimization (GRPO), combines all reward scores before calculating the 'advantage value' (the teaching signal).
💡 This process leads to reward signal collapse, where distinct outcomes appear identical to the AI, resulting in a weak training signal.
❌ Consequences include suboptimal convergence (slow, inaccurate learning) and even early training failure, where models stop improving.

GDPO's Decoupled Normalization

✅ Group reward-Decoupled Normalization Policy Optimization (GDPO) resolves this by calculating the advantage value for each individual reward separately.
🚀 Only after individual normalization are these separate advantage values summed, preserving the resolution and providing a clear, precise teaching signal.
✨ This approach gives the model a crystal clear message, allowing it to understand nuanced differences between good and great attempts.

GDPO's Performance Advantages

📊 In tool calling tasks, GDPO significantly improved overall accuracy and achieved over 4% better format compliance compared to GRPO.
📈 For math reasoning problems with conflicting rewards (accuracy vs. length), GRPO showed training collapse, while GDPO maintained stability, continuously improving accuracy and reducing

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph33 entities · 38 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

33 entities

Chapters2 moments

Key Moments

Transcript48 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Large Language Models (LLMs)Reinforcement Learning (RL)Multi-reward optimizationGroup Relative Policy Optimization (GRPO)Group reward-Decoupled Normalization Policy Optimization (GDPO)Reward signal collapseAdvantage valueDecoupled normalizationTraining stabilityTool callingMath reasoningCoding reasoningEasy goal takeoverConditioned rewardAccuracy

Smart Objects33 · 38 links

Concepts· 32

Event· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free