Token-Level Data Filtering: Preventing Dangerous AI Capabilities in Pre-training

[HPP] Alec RadfordFebruary 14, 202624 min

23 connections·30 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Challenges of Traditional AI Safety

⚠️ Post-processing methods like RLHF and SFT only suppress dangerous knowledge output, not remove it from the model's knowledge base.
💥 Machine unlearning techniques (e.g., RMU) are fragile and can be easily bypassed by adversarial fine-tuning, causing forgotten knowledge to "rebound."
🧠 Large models' distributed knowledge storage and strong generalization make surgical removal of dangerous information post-training extremely difficult.
🗑️ Document-level data filtering is inefficient, leading to the deletion of valuable benign data and failing to catch dangerous tokens embedded within benign documents.

Introducing Token-Level Data Filtering

💡 The core innovation is to precisely filter data at the token level during pre-training, rather than at the document level.
🎭 Loss Masking allows the model to see all tokens for context but prevents it from learning from dangerous tokens by ignoring their gradient contributions.
✂️ Direct Removal replaces dangerous tokens with a special hidden placeholder, completely cutting off the model's exposure to them.
✅ This approach achieves Pareto improvement, effectively weakening dangerous capabilities without negatively impacting benign ones.

Experimental Validation and Impact

🔬 Experiments successfully demonstrated precise capability cutting, removing medical knowledge while preserving biological knowledge.
🚀 The effectiveness of token-level filtering scales exponentially with model size, increasing attack costs by 7000-fold for 1.8 billion parameter models.
💪 Token filtering proved 13 times more robust to adversarial fine-tuning than state-of-the-art RMU techniques for large models.
📊 Models trained with token filtering showed near-random performance on medical tests but maintained high accuracy on biological and general intelligence tasks.

Technical Implementation Details

🔑 High-quality seed labels are generated using sparse autoencoders to detect independent concepts, classified by models like Claude Sonnet 4.
🧠 A small, bidirectional language model classifier (224M parameters) is trained to accurately identify dangerous tokens, leveraging context for precision.
🧩 A two-stage filtering process combines a document-level classifier for initial coarse screening with the token-level classifier for fine-grained precision.
🌱 The method exhibits weak-to-strong generalization, allowing high-performance token-level classifiers to be trained effectively even with noisy or coarse-grained labels.

Reshaping AI Safety Paradigms

🎯 This research shifts AI safety from reactive post-processing to proactive pre-training filtering, enabling "surgical" capability shaping.
🧬 It allows for "gene editing" of AI capabilities, precisely controlling what knowledge the AI can learn from its inception.
🌐 The approach offers a path to build inherently harmless large models, preventing them from ever acquiring dangerous knowledge.
🔮 Future work will explore unsupervised labeling, cross-language filtering, and addressing knowledge acquired through external tools.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph30 entities · 23 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

30 entities

Chapters10 moments

Key Moments

Transcript89 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Token-level data filteringAI safetyPre-trainingLarge language modelsMachine unlearningAdversarial fine-tuningSparse autoencodersBidirectional language modelsWeak-to-strong generalizationPareto improvementLoss maskingDirect removalKnowledge fragmentsComputational costsModel capabilities

Smart Objects30 · 23 links

Concepts· 20

Medias· 6

People· 2

Company· 1

Product· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free