Skip to main content

GDPO: Decoupling Rewards for Stable Multi-Reward RL in LLMs

[HPP] Yejin ChoiJanuary 10, 202613 min
38 connections·33 entities in this video→

The Challenge of Multi-Reward RL

  • 🧠 Large Language Models (LLMs) need to follow multiple, often conflicting, rules simultaneously, such as accuracy, politeness, specific formats, and length constraints.
  • ⚠️ Standard reinforcement learning (RL) methods struggle when models are presented with diverse and sometimes contradictory goals.

GRPO's Reward Signal Collapse

  • πŸ“‰ The previous method, Group Relative Policy Optimization (GRPO), combines all reward scores before calculating the 'advantage value' (the teaching signal).
  • πŸ’‘ This process leads to reward signal collapse, where distinct outcomes appear identical to the AI, resulting in a weak training signal.
  • ❌ Consequences include suboptimal convergence (slow, inaccurate learning) and even early training failure, where models stop improving.

GDPO's Decoupled Normalization

  • βœ… Group reward-Decoupled Normalization Policy Optimization (GDPO) resolves this by calculating the advantage value for each individual reward separately.
  • πŸš€ Only after individual normalization are these separate advantage values summed, preserving the resolution and providing a clear, precise teaching signal.
  • ✨ This approach gives the model a crystal clear message, allowing it to understand nuanced differences between good and great attempts.

GDPO's Performance Advantages

  • πŸ“Š In tool calling tasks, GDPO significantly improved overall accuracy and achieved over 4% better format compliance compared to GRPO.
  • πŸ“ˆ For math reasoning problems with conflicting rewards (accuracy vs. length), GRPO showed training collapse, while GDPO maintained stability, continuously improving accuracy and reducing
Knowledge graph33 entities Β· 38 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover Β· drag to explore
33 entities
Chapters2 moments

Key Moments

Transcript48 segments

Full Transcript

Topics15 themes

What’s Discussed

Large Language Models (LLMs)Reinforcement Learning (RL)Multi-reward optimizationGroup Relative Policy Optimization (GRPO)Group reward-Decoupled Normalization Policy Optimization (GDPO)Reward signal collapseAdvantage valueDecoupled normalizationTraining stabilityTool callingMath reasoningCoding reasoningEasy goal takeoverConditioned rewardAccuracy
Smart Objects33 Β· 38 links
ConceptsΒ· 32
EventΒ· 1