Skip to main content

ALiBi: Attention with Linear Biases for Transformer Extrapolation

[HPP] Ashish VaswaniJanuary 1, 202611 min
34 connections·39 entities in this video

The Challenge of Long Sequence Training

  • 💡 Training models with longer sequences is computationally expensive and complex.
  • 🎯 The goal is to extrapolate to longer sequences at inference time, beyond what was seen during training.
  • ⚠️ Traditional position embeddings, such as sinusoidal and rotary methods, are insufficient for efficient extrapolation.

How ALiBi Works

  • 🧠 ALiBi (Attention with Linear Biases) is introduced as a simpler and more efficient positional method.
  • 🚫 It does not add positional embeddings to word embeddings directly.
  • 🔑 Instead, ALiBi biases query-key attention scores with a negative penalty that is proportional to their distance.
  • ✨ This bias is static and non-learned, introducing no additional parameters or runtime overhead.
  • 🚀 ALiBi incorporates an inductive bias towards recency, meaning closer keys have a greater impact on the query.

Performance and Extrapolation Benefits

  • 📈 ALiBi models demonstrate superior extrapolation capabilities for longer inference sequences.
  • 📊 They achieve lower perplexity compared to sinusoidal and rotary position embeddings when extrapolating beyond trained sequence lengths.
  • ⚡ A 1.3-billion-parameter ALiBi model trained on 1024 tokens achieved the same perplexity as a sinusoidal model trained on 2048 tokens, while training 11% faster and using 11% less memory.
  • ✅ ALiBi outperforms other strong positional methods, including the T5 bias method, which suffers from slow training and inference speeds.

Efficiency and Popularity

  • 🛠️ ALiBi introduces no additional parameters and requires only a negligible amount of extra memory (e.g., 100 MB for 1024-length sequences).
  • 🏆 ALiBi models are shown to be better than sinusoidal models even when trained on fewer tokens, offering improved quality for extrapolation.
  • 🌐 The ALiBi method has become very popular and is now used in several large language models to encode positional signals.
Knowledge graph39 entities · 34 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
39 entities
Chapters5 moments

Key Moments

Transcript43 segments

Full Transcript

Topics15 themes

What’s Discussed

Attention with Linear Biases (ALiBi)Transformer modelsPositional embeddingsSequence extrapolationQuery-key attentionPerplexitySinusoidal embeddingsRotary position embeddingsInductive biasRecency biasLarge language modelsWikiText-103Training efficiencyMemory usageT5 bias method
Smart Objects39 · 34 links
Concepts· 32
Medias· 2
Person· 1
Products· 2
Companies· 2