Skip to main content

Scale Can’t Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[HPP] Percy LiangDecember 2, 202558 min
30 connections·40 entities in this video

The Challenge of Vision-Language Reasoning

  • 💡 Vision-Language Models (VLMs), despite high performance on various benchmarks, still struggle with basic reasoning capabilities.
  • 🎯 For example, models like Lava fail to correctly identify simple spatial relationships, such as a mug being under a table.
  • 🔑 This persistent issue suggests a fundamental problem beyond just model architecture or pre-training tasks.

Understanding Reporting Bias

  • 🧠 The core hypothesis is that VLM limitations stem from reporting bias in their training data, where human captioners omit tacit information.
  • 📚 Drawing from linguistics and pragmatics (e.g., Gricean maxims like quantity, relevance, and manner), people naturally avoid stating obvious or redundant information.
  • 🚫 This leads to a significant underrepresentation of spatial, temporal, counting, and negation language in web-scale image-text corpora.

Impact on Model Performance

  • 📊 Analysis of datasets like Leyon confirms a very low occurrence of reasoning-related language, far less than common descriptive terms.
  • 📉 Contrastive models (e.g., Clip) perform poorly on these reasoning tasks, often ignoring negations entirely, while generative models (e.g., Lava, Malmo) show improvement but still fall short of human accuracy.
  • 📈 Unlike recognition tasks, scaling data or model parameters does not significantly improve reasoning capabilities; the human behaviors causing reporting bias persist regardless of scale.

The Solution: Intentional Data Curation

  • ✅ A controlled study demonstrates that clear and specific annotator instructions can successfully elicit the missing reasoning-related information.
  • 🛠️ By explicitly asking annotators to describe counts, positions, temporal events, and negations, the presence of these concepts in captions increases dramatically.
  • 🚀 The amount of data collected through such intentional curation is sufficient for models to learn these underlying reasoning tasks, highlighting the critical role of data quality over sheer scale for VLM development.
Knowledge graph40 entities · 30 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
40 entities
Chapters20 moments

Key Moments

Transcript218 segments

Full Transcript

Topics15 themes

What’s Discussed

Vision-Language Models (VLMs)Reporting BiasTraining DataReasoning CapabilitiesLinguisticsPragmaticsGricean MaximsSpatial ReasoningTemporal ReasoningCountingNegationContrastive ModelsGenerative ModelsAnnotator InstructionsData Curation
Smart Objects40 · 30 links
People· 6
Concepts· 17
Products· 8
Medias· 4
Location· 1
Companies· 3
Event· 1