Large Reasoning Models: Thinking Capabilities and Performance Limitations

[HPP] Samy BengioJuly 21, 202515 min

13 connections·23 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Investigating LRM Thinking Capabilities

💡 The study explores whether Large Reasoning Models (LRMs) genuinely think or if their performance is due to pre-training knowledge or data contamination.
🧠 Unlike standard Large Language Models (LLMs), LRMs generate detailed reasoning traces before providing a final answer.

Puzzle-Based Evaluation Methodology

🎯 Researchers used puzzle-based thinking problems like Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World to ensure no pre-training data contamination.
🔬 These puzzles allow for precise manipulation of compositional complexity by varying parameters like the number of disks or tokens.

Three Performance Regimes

📊 Low-complexity tasks: Standard LLMs surprisingly outperform LRMs, which use significantly more computational effort (tokens) due to overthinking.
📈 Medium-complexity tasks: LRMs demonstrate a clear advantage, with their additional thinking processes leading to better accuracy.
⚠️ High-complexity tasks: Both LLMs and LRMs experience a complete accuracy collapse, failing to solve problems effectively.

Reasoning Effort and Overthinking

📉 For low-complexity problems, LRMs exhibit overthinking, using excessive tokens and often exploring incorrect alternatives even after finding a correct path.
🧠 Counter-intuitively, as problem complexity increases beyond a certain point, LRMs refuse to use more tokens, leading to a decline in reasoning effort despite having an adequate budget.
🚫 In failed high-complexity cases, LRMs tend to fixate on early wrong answers, wasting their token budget without reaching correct solutions.

Algorithmic Understanding Limitations

❌ A critical finding is that providing the explicit algorithmic solution as input does not improve LRM performance for high-complexity problems.
🧩 This suggests LRMs struggle with generalizable algorithmic thinking and cannot consistently execute algorithms for specific problem instances.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph23 entities · 13 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

23 entities

Chapters8 moments

Key Moments

Transcript60 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics14 themes

What’s Discussed

Large Reasoning Models (LRMs)Large Language Models (LLMs)Puzzle-based problemsProblem complexityReasoning tracesAlgorithmic thinkingPerformance regimesData contaminationToken budgetOverthinkingAccuracy collapseTower of HanoiCheckers JumpingRiver Crossing

Smart Objects23 · 13 links

Concepts· 15

Medias· 3

Companies· 3

Products· 2

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free