GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

[HPP] Jerry TworekOctober 21, 202538 min

27 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Introducing the GDPval Benchmark

💡 The GDPval benchmark evaluates AI model performance on real-world economically valuable tasks, moving beyond traditional academic tests that often feel detached from practical utility.
🎯 It covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP.
🔑 Tasks are meticulously constructed from the work of experienced industry professionals (averaging 14 years of experience) and are designed to be complex, long-horizon, and multimodal, requiring manipulation of diverse professional file formats.

Methodology and Quality Control

🔬 The benchmark emphasizes realism and representative breadth, contrasting with domain-specific or abstract reasoning evaluations by focusing on tasks that produce real deliverables.
✅ Tasks are highly valuable, with the gold subset averaging $39,846 in estimated dollar value per task, and requiring significant human effort (average 7 hours, up to 100 hours).
🛠️ Quality control is rigorous, involving an average of five human reviews per task and a model-in-the-loop screening process to ensure adherence to high industry standards.
📊 Model outputs are graded using blinded expert pairwise comparisons, considering subjective factors like document structure, professional style, and aesthetics, in addition to objective correctness.

Key Performance Results

📈 Claude Opus 4.1 emerged as the top-performing model, achieving a 47.6% win rate (wins plus ties) against human experts, indicating it's remarkably close to human parity.
🚀 The study observed a roughly linear trajectory of capability improvement for OpenAI frontier models over time on the GDPval gold subset, suggesting predictable progress.
🧠 Models exhibit divergent strengths: Claude Opus 4.1 excels in subjective factors like aesthetics and multimodal file types, while GPT-5 High demonstrates superior objective accuracy and complex instruction following.
⚠️ Model performance shows an inverse correlation with task complexity, performing best on less time-intensive tasks (56% win rate for <2hr tasks) but struggling significantly with longer, more complex ones.

Efficiency Gains and Failure Analysis

💰 Applying a "try N times, then fix it" workflow, GPT-5 High showed potential for a 1.39x speed improvement and a 1.63x cost improvement compared to unaided human experts, even with significant human review time.
🧩 The primary failure mode for most models is instruction following (35-40% of losses), while GPT-5 High's main issue was formatting errors, suggesting challenges in rigorous execution rather than fundamental reasoning.
🛡️ Critically, only 2.7% of GPT-5 High's failures were rated catastrophic, implying that most human review time is spent correcting minor to moderate issues, making the human-supervised workflow viable.

Levers for Improvement and Limitations

✨ Reasoning effort and scaffolding/prompt tuning significantly improve model performance; simple self-check instructions dramatically increased GPT-5 High's self-inspection rate from 15% to 97%.
💬 The study highlights that models still struggle with under-contextualized or ambiguous tasks, demonstrating a need for better handling of real-world messy requests and interactive clarification.
🚧 Current limitations include the initial dataset size and a focus on self-contained digital knowledge work, excluding tasks requiring physical interaction, proprietary knowledge, or complex synchronous communication.
🌐 By open-sourcing the gold subset and an experimental automated grader, GDPval provides a clear, attributable leading indicator of AI's economic relevance, fostering future research and benchmarking efforts.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 27 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters4 moments

Key Moments

Transcript143 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

GDPval benchmarkAI model performanceEconomically valuable tasksReal-world applicationsUS Bureau of Labor StatisticsUS GDPFrontier modelsMultimodal tasksBlinded expert comparisonAutomated gradingInstruction followingPrompt tuningScaffoldingEconomic impactProductivity shifts

Smart Objects40 · 27 links

Products· 7

Medias· 4

People· 3

Concepts· 23

Companies· 2

Event· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free