Large Reasoning Models: Thinking Capabilities and Performance Limitations
[HPP] Samy BengioJuly 21, 202515 min
13 connectionsΒ·23 entities in this videoβInvestigating LRM Thinking Capabilities
- π‘ The study explores whether Large Reasoning Models (LRMs) genuinely think or if their performance is due to pre-training knowledge or data contamination.
- π§ Unlike standard Large Language Models (LLMs), LRMs generate detailed reasoning traces before providing a final answer.
Puzzle-Based Evaluation Methodology
- π― Researchers used puzzle-based thinking problems like Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World to ensure no pre-training data contamination.
- π¬ These puzzles allow for precise manipulation of compositional complexity by varying parameters like the number of disks or tokens.
Three Performance Regimes
- π Low-complexity tasks: Standard LLMs surprisingly outperform LRMs, which use significantly more computational effort (tokens) due to overthinking.
- π Medium-complexity tasks: LRMs demonstrate a clear advantage, with their additional thinking processes leading to better accuracy.
- β οΈ High-complexity tasks: Both LLMs and LRMs experience a complete accuracy collapse, failing to solve problems effectively.
Reasoning Effort and Overthinking
- π For low-complexity problems, LRMs exhibit overthinking, using excessive tokens and often exploring incorrect alternatives even after finding a correct path.
- π§ Counter-intuitively, as problem complexity increases beyond a certain point, LRMs refuse to use more tokens, leading to a decline in reasoning effort despite having an adequate budget.
- π« In failed high-complexity cases, LRMs tend to fixate on early wrong answers, wasting their token budget without reaching correct solutions.
Algorithmic Understanding Limitations
- β A critical finding is that providing the explicit algorithmic solution as input does not improve LRM performance for high-complexity problems.
- π§© This suggests LRMs struggle with generalizable algorithmic thinking and cannot consistently execute algorithms for specific problem instances.
Knowledge graph23 entities Β· 13 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover Β· drag to explore
23 entities
Chapters8 moments
Key Moments
Transcript60 segments
Full Transcript
Topics14 themes
Whatβs Discussed
Large Reasoning Models (LRMs)Large Language Models (LLMs)Puzzle-based problemsProblem complexityReasoning tracesAlgorithmic thinkingPerformance regimesData contaminationToken budgetOverthinkingAccuracy collapseTower of HanoiCheckers JumpingRiver Crossing
Smart Objects23 Β· 13 links
ConceptsΒ· 15
MediasΒ· 3
CompaniesΒ· 3
ProductsΒ· 2