ALiBi: Attention with Linear Biases for Transformer Extrapolation

[HPP] Ashish VaswaniJanuary 1, 202611 min

34 connections·39 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Challenge of Long Sequence Training

💡 Training models with longer sequences is computationally expensive and complex.
🎯 The goal is to extrapolate to longer sequences at inference time, beyond what was seen during training.
⚠️ Traditional position embeddings, such as sinusoidal and rotary methods, are insufficient for efficient extrapolation.

How ALiBi Works

🧠 ALiBi (Attention with Linear Biases) is introduced as a simpler and more efficient positional method.
🚫 It does not add positional embeddings to word embeddings directly.
🔑 Instead, ALiBi biases query-key attention scores with a negative penalty that is proportional to their distance.
✨ This bias is static and non-learned, introducing no additional parameters or runtime overhead.
🚀 ALiBi incorporates an inductive bias towards recency, meaning closer keys have a greater impact on the query.

Performance and Extrapolation Benefits

📈 ALiBi models demonstrate superior extrapolation capabilities for longer inference sequences.
📊 They achieve lower perplexity compared to sinusoidal and rotary position embeddings when extrapolating beyond trained sequence lengths.
⚡ A 1.3-billion-parameter ALiBi model trained on 1024 tokens achieved the same perplexity as a sinusoidal model trained on 2048 tokens, while training 11% faster and using 11% less memory.
✅ ALiBi outperforms other strong positional methods, including the T5 bias method, which suffers from slow training and inference speeds.

Efficiency and Popularity

🛠️ ALiBi introduces no additional parameters and requires only a negligible amount of extra memory (e.g., 100 MB for 1024-length sequences).
🏆 ALiBi models are shown to be better than sinusoidal models even when trained on fewer tokens, offering improved quality for extrapolation.
🌐 The ALiBi method has become very popular and is now used in several large language models to encode positional signals.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph39 entities · 34 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

39 entities

Chapters5 moments

Key Moments

Transcript43 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Attention with Linear Biases (ALiBi)Transformer modelsPositional embeddingsSequence extrapolationQuery-key attentionPerplexitySinusoidal embeddingsRotary position embeddingsInductive biasRecency biasLarge language modelsWikiText-103Training efficiencyMemory usageT5 bias method

Smart Objects39 · 34 links

Concepts· 32

Medias· 2

Person· 1

Products· 2

Companies· 2

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free