BrowseComp: A Benchmark for Web Browsing Agents

[HPP] Hyung Won ChungNovember 1, 20258 min

24 connections·27 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Understanding BrowseComp

💡 BrowseComp is a benchmark designed to test the ability of browsing agents to navigate the web and find information.
🎯 It evaluates an agent's capacity for reasoning, factuality, persistent internet navigation, and creative problem-solving for complex questions.
🧩 The benchmark comprises 1,266 challenging questions that require finding hard-to-locate, entangled information, with answers that are short and easily verifiable.

Benchmark Design & Difficulty

🔍 Questions are specifically crafted to be difficult, not solvable by humans within 10 minutes or by advanced models like GPT-4o and early Deep Research.
🚫 Answers are intentionally not found on the top pages of five simple Google searches, demanding deeper exploration.
📚 The dataset spans diverse domains, including TV shows, movies, science, technology, art, history, sports, music, video games, geography, and politics.

Human Performance on BrowseComp

⚠️ Humans found the benchmark extremely challenging, with 70.8% giving up after two hours of effort.
⏳ Only 29% of questions were solved by human trainers, and many of these required more than two hours to complete.
✅ For the questions that were solved, there was a high agreement of 86% between trainer and reference answers, indicating consistency when solutions were found.

LLM Performance & Calibration

📊 GPT-4 achieved 6% accuracy, while OpenAI Deep Research performed significantly better with 51.5% accuracy on a single sample.
📈 Deep Research exhibited a high calibration error, meaning the model was often overconfident in its incorrect answers.
🧠 Despite high calibration error, the model frequently knows when it's right if the highest probability answer is selected.

Improving Accuracy with Sampling

🚀 By sampling multiple answers (e.g., up to 64 samples) and selecting the best of N (highest probability), Deep Research's accuracy can reach 77-78%.
💡 This method suggests that while the model may struggle with expressing calibrated certainty, it often assigns higher probabilities to correct answers.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph27 entities · 24 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

27 entities

Chapters4 moments

Key Moments

Transcript33 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Browsing agentsBrowseComp benchmarkWeb browsingLarge language modelsInternet navigationFactualityProblem-solvingGPT-4OpenAI Deep ResearchCalibration errorAccuracyMultiple samplingProbability scoresSeed entitiesDiverse domains

Smart Objects27 · 24 links

Medias· 2

Products· 3

Companies· 2

Concepts· 19

Person· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free