Skip to main content

Understanding Context Rot in Long Context Models with Kelly Hong

Jason LiuNovember 18, 202551 min217 views
29 connections·40 entities in this video

The Problem of Context Rot

  • 💡 Context rot describes the phenomenon where long context models exhibit non-uniform performance across varying input lengths, meaning reliability can decrease even with more context.
  • ⚠️ This is not a new issue, but rather a quantification of qualitative observations users have made with chatbots and coding assistants, where output quality degrades over time.
  • 🎯 Current frontier models often highlight massive context windows (e.g., 1 million tokens), implying consistent performance, which this research challenges.

Limitations of Benchmarks and Tasks

  • 🔬 The "Needle in a Haystack" benchmark, commonly used to validate long context performance, is often too simplistic.
  • 🔑 It relies heavily on lexical matching, where exact keywords can be found, rather than deeper semantic understanding.
  • 🧠 Tasks requiring semantic matching or inference show a significant performance drop as context length increases, indicating models struggle with nuanced understanding.
  • 📊 Real-world scenarios, like analyzing financial reports, often involve queries that require inferring connections (e.g., "overseas expansion") rather than simple keyword matches.

Impact of Distractors and Structure

  • ⚠️ Adding distractors—information similar but not identical to the correct answer—further degrades model performance, especially as context length grows.
  • 📉 Models tend to either abstain from answering or confidently provide incorrect answers when faced with distractors.
  • 🧩 Counterintuitively, models performed better on randomly shuffled context compared to coherent essays, suggesting we cannot assume human-like structured processing.

Challenges with Long Context and Compaction

  • 💬 In conversational memory tasks, models perform significantly worse with full context compared to when only relevant information is provided.
  • 🔠 Even simple tasks like text replication can degrade performance with long inputs, leading to unexpected behaviors like refusal to answer or random outputs.
  • ⚙️ Current context compaction methods, like summarization in coding assistants, are often naive and can lead to loss of critical information.

Context Engineering and Evaluation

  • 🛠️ Context engineering is critical, involving how context is managed and presented to the model, and is highly use-case dependent.
  • 🧩 Strategies like breaking tasks into sub-agents or using vector databases for semantic retrieval can help condense context.
  • 📈 Evaluation on specific data and use cases is more important than adopting generic best practices, as model performance varies significantly.
  • 🔍 Debugging each step of an agent's process, rather than just looking at the final output, is crucial for understanding and improving performance.
Knowledge graph40 entities · 29 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
40 entities
Chapters19 moments

Key Moments

Transcript191 segments

Full Transcript

Topics13 themes

What’s Discussed

Context RotLong Context ModelsLLM PerformanceNeedle in a Haystack BenchmarkLexical MatchingSemantic MatchingDistractorsContext EngineeringVector DatabasesRetrieval Augmented Generation (RAG)Model EvaluationAgentic WorkflowsContext Compaction
Smart Objects40 · 29 links
Person· 1
Companies· 2
Locations· 3
Concepts· 18
Medias· 7
Products· 8
Event· 1