Titans: Learning to Memorize at Test Time (Paper Analysis)

[HPP] Yannic KilcherDecember 14, 202532 min

29 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Addressing Context Window Limitations

💡 Traditional Transformers are limited by a fixed context window, leading to quadratic computational costs for longer sequences.
🧠 Recurrent models (RNNs, LSTMs) compress data into a fixed-size hidden state, but have been largely surpassed by attention-based models.
🎯 Early attempts like Transformer XL used a compressed hidden state from previous chunks to extend context, blending RNN-like state passing with Transformer attention.

Linear Transformers and Their Critique

⚡ Linear Transformers aim to overcome quadratic cost by using kernel functions to reformulate the softmax operation, allowing for linear accumulation of keys and values.
⚠️ The paper argues that linear transformers are not competitive because they compress data into a matrix-valued state, similar to RNNs, which is insufficient for very long contexts.
💬 The speaker expresses skepticism about this critique, suggesting the issue might be the choice of kernel function rather than the concept of compression itself.

Introducing Titans: Neural Memory

🚀 The "Titans" architecture proposes a neural network as memory that learns to memorize and compress data at test time.
💡 This neural memory provides information about the distant past, extending beyond the current context window.
🧠 The memory acts as a function where a query is input, and relevant past data is retrieved for current inference.

How Neural Memory Functions

🛠️ The memory is a two-layer neural network (MLP) whose parameters are updated during test time.
✅ It is trained to associate keys with values, meaning if a query is similar to a key, it should output the corresponding value.
📈 Updates are driven by a "surprise" loss function, which the speaker identifies as essentially gradient descent with momentum.

Critique of Terminology and Approach

🗣️ The speaker notes that some concepts, like "surprise" for gradient descent or "persistent memory" for learned parameters, are presented with marketing-oriented language rather than direct technical terms.
🧩 The distinction between "vector-valued" and "neural network-based" memory is questioned, as the speaker argues that arbitrary non-linearity in storing/retrieving can make them functionally equivalent.
💬 The paper's emphasis on the novelty of neural network memory might be overstated, as complex retrieval/storage mechanisms can achieve similar results with simpler memory structures.

Overall Assessment of Titans

👏 The core idea of memorizing at test time is deemed highly valuable and necessary for future models to handle extremely long contexts.
📊 Titans models demonstrate strong performance and effective scaling to context windows larger than 2 million, outperforming Transformers and modern linear recurrent models.
✨ Despite some critiques of presentation, the paper is considered good and points towards a crucial direction for advancing language models.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 29 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters12 moments

Key Moments

Transcript118 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Titans ArchitectureNeural MemoryContext Window LimitationsTransformersLinear TransformersMemorization at Test TimeGradient Descent with MomentumAttention MechanismsRecurrent Neural NetworksKernel FunctionsPersistent MemoryLanguage ModelingMatrix-valued StateKey-Value AssociationsMulti-Layer Perceptron

Smart Objects40 · 29 links

Concepts· 33

Company· 1

Medias· 4

Person· 1

Event· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free