Titans: Learning to Memorize at Test Time (Paper Analysis)
[HPP] Yannic KilcherDecember 14, 202532 min
29 connectionsΒ·40 entities in this videoβAddressing Context Window Limitations
- π‘ Traditional Transformers are limited by a fixed context window, leading to quadratic computational costs for longer sequences.
- π§ Recurrent models (RNNs, LSTMs) compress data into a fixed-size hidden state, but have been largely surpassed by attention-based models.
- π― Early attempts like Transformer XL used a compressed hidden state from previous chunks to extend context, blending RNN-like state passing with Transformer attention.
Linear Transformers and Their Critique
- β‘ Linear Transformers aim to overcome quadratic cost by using kernel functions to reformulate the softmax operation, allowing for linear accumulation of keys and values.
- β οΈ The paper argues that linear transformers are not competitive because they compress data into a matrix-valued state, similar to RNNs, which is insufficient for very long contexts.
- π¬ The speaker expresses skepticism about this critique, suggesting the issue might be the choice of kernel function rather than the concept of compression itself.
Introducing Titans: Neural Memory
- π The "Titans" architecture proposes a neural network as memory that learns to memorize and compress data at test time.
- π‘ This neural memory provides information about the distant past, extending beyond the current context window.
- π§ The memory acts as a function where a query is input, and relevant past data is retrieved for current inference.
How Neural Memory Functions
- π οΈ The memory is a two-layer neural network (MLP) whose parameters are updated during test time.
- β It is trained to associate keys with values, meaning if a query is similar to a key, it should output the corresponding value.
- π Updates are driven by a "surprise" loss function, which the speaker identifies as essentially gradient descent with momentum.
Critique of Terminology and Approach
- π£οΈ The speaker notes that some concepts, like "surprise" for gradient descent or "persistent memory" for learned parameters, are presented with marketing-oriented language rather than direct technical terms.
- π§© The distinction between "vector-valued" and "neural network-based" memory is questioned, as the speaker argues that arbitrary non-linearity in storing/retrieving can make them functionally equivalent.
- π¬ The paper's emphasis on the novelty of neural network memory might be overstated, as complex retrieval/storage mechanisms can achieve similar results with simpler memory structures.
Overall Assessment of Titans
- π The core idea of memorizing at test time is deemed highly valuable and necessary for future models to handle extremely long contexts.
- π Titans models demonstrate strong performance and effective scaling to context windows larger than 2 million, outperforming Transformers and modern linear recurrent models.
- β¨ Despite some critiques of presentation, the paper is considered good and points towards a crucial direction for advancing language models.
Knowledge graph40 entities Β· 29 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover Β· drag to explore
40 entities
Chapters12 moments
Key Moments
Transcript118 segments
Full Transcript
Topics15 themes
Whatβs Discussed
Titans ArchitectureNeural MemoryContext Window LimitationsTransformersLinear TransformersMemorization at Test TimeGradient Descent with MomentumAttention MechanismsRecurrent Neural NetworksKernel FunctionsPersistent MemoryLanguage ModelingMatrix-valued StateKey-Value AssociationsMulti-Layer Perceptron
Smart Objects40 Β· 29 links
ConceptsΒ· 33
CompanyΒ· 1
MediasΒ· 4
PersonΒ· 1
EventΒ· 1