ALiBi: Attention with Linear Biases for Transformer Extrapolation
[HPP] Ashish VaswaniJanuary 1, 202611 min
34 connections·39 entities in this video→The Challenge of Long Sequence Training
- 💡 Training models with longer sequences is computationally expensive and complex.
- 🎯 The goal is to extrapolate to longer sequences at inference time, beyond what was seen during training.
- ⚠️ Traditional position embeddings, such as sinusoidal and rotary methods, are insufficient for efficient extrapolation.
How ALiBi Works
- 🧠 ALiBi (Attention with Linear Biases) is introduced as a simpler and more efficient positional method.
- 🚫 It does not add positional embeddings to word embeddings directly.
- 🔑 Instead, ALiBi biases query-key attention scores with a negative penalty that is proportional to their distance.
- ✨ This bias is static and non-learned, introducing no additional parameters or runtime overhead.
- 🚀 ALiBi incorporates an inductive bias towards recency, meaning closer keys have a greater impact on the query.
Performance and Extrapolation Benefits
- 📈 ALiBi models demonstrate superior extrapolation capabilities for longer inference sequences.
- 📊 They achieve lower perplexity compared to sinusoidal and rotary position embeddings when extrapolating beyond trained sequence lengths.
- ⚡ A 1.3-billion-parameter ALiBi model trained on 1024 tokens achieved the same perplexity as a sinusoidal model trained on 2048 tokens, while training 11% faster and using 11% less memory.
- ✅ ALiBi outperforms other strong positional methods, including the T5 bias method, which suffers from slow training and inference speeds.
Efficiency and Popularity
- 🛠️ ALiBi introduces no additional parameters and requires only a negligible amount of extra memory (e.g., 100 MB for 1024-length sequences).
- 🏆 ALiBi models are shown to be better than sinusoidal models even when trained on fewer tokens, offering improved quality for extrapolation.
- 🌐 The ALiBi method has become very popular and is now used in several large language models to encode positional signals.
Knowledge graph39 entities · 34 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover · drag to explore
39 entities
Chapters5 moments
Key Moments
Transcript43 segments
Full Transcript
Topics15 themes
What’s Discussed
Attention with Linear Biases (ALiBi)Transformer modelsPositional embeddingsSequence extrapolationQuery-key attentionPerplexitySinusoidal embeddingsRotary position embeddingsInductive biasRecency biasLarge language modelsWikiText-103Training efficiencyMemory usageT5 bias method
Smart Objects39 · 34 links
Concepts· 32
Medias· 2
Person· 1
Products· 2
Companies· 2