Scale Can’t Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[HPP] Percy LiangDecember 2, 202558 min

30 connections·40 entities in this video→

The Challenge of Vision-Language Reasoning

💡 Vision-Language Models (VLMs), despite high performance on various benchmarks, still struggle with basic reasoning capabilities.
🎯 For example, models like Lava fail to correctly identify simple spatial relationships, such as a mug being under a table.
🔑 This persistent issue suggests a fundamental problem beyond just model architecture or pre-training tasks.

Understanding Reporting Bias

🧠 The core hypothesis is that VLM limitations stem from reporting bias in their training data, where human captioners omit tacit information.
📚 Drawing from linguistics and pragmatics (e.g., Gricean maxims like quantity, relevance, and manner), people naturally avoid stating obvious or redundant information.
🚫 This leads to a significant underrepresentation of spatial, temporal, counting, and negation language in web-scale image-text corpora.

Impact on Model Performance

📊 Analysis of datasets like Leyon confirms a very low occurrence of reasoning-related language, far less than common descriptive terms.
📉 Contrastive models (e.g., Clip) perform poorly on these reasoning tasks, often ignoring negations entirely, while generative models (e.g., Lava, Malmo) show improvement but still fall short of human accuracy.
📈 Unlike recognition tasks, scaling data or model parameters does not significantly improve reasoning capabilities; the human behaviors causing reporting bias persist regardless of scale.

The Solution: Intentional Data Curation

✅ A controlled study demonstrates that clear and specific annotator instructions can successfully elicit the missing reasoning-related information.
🛠️ By explicitly asking annotators to describe counts, positions, temporal events, and negations, the presence of these concepts in captions increases dramatically.
🚀 The amount of data collected through such intentional curation is sufficient for models to learn these underlying reasoning tasks, highlighting the critical role of data quality over sheer scale for VLM development.

Knowledge graph40 entities · 30 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Ask, don't scrub

Have a conversation with this video.

VERIDIVE answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Chapters20 moments

Key Moments

Transcript218 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

VERIDIVE maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Vision-Language Models (VLMs)Reporting BiasTraining DataReasoning CapabilitiesLinguisticsPragmaticsGricean MaximsSpatial ReasoningTemporal ReasoningCountingNegationContrastive ModelsGenerative ModelsAnnotator InstructionsData Curation

Smart Objects40 · 30 links

People· 6

Concepts· 17

Products· 8

Medias· 4

Location· 1

Companies· 3

Event· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free