Token-Level Data Filtering: Preventing Dangerous AI Capabilities in Pre-training
[HPP] Alec RadfordFebruary 14, 202624 min
23 connections·30 entities in this video→Challenges of Traditional AI Safety
- ⚠️ Post-processing methods like RLHF and SFT only suppress dangerous knowledge output, not remove it from the model's knowledge base.
- 💥 Machine unlearning techniques (e.g., RMU) are fragile and can be easily bypassed by adversarial fine-tuning, causing forgotten knowledge to "rebound."
- 🧠 Large models' distributed knowledge storage and strong generalization make surgical removal of dangerous information post-training extremely difficult.
- 🗑️ Document-level data filtering is inefficient, leading to the deletion of valuable benign data and failing to catch dangerous tokens embedded within benign documents.
Introducing Token-Level Data Filtering
- 💡 The core innovation is to precisely filter data at the token level during pre-training, rather than at the document level.
- 🎭 Loss Masking allows the model to see all tokens for context but prevents it from learning from dangerous tokens by ignoring their gradient contributions.
- ✂️ Direct Removal replaces dangerous tokens with a special
hiddenplaceholder, completely cutting off the model's exposure to them. - ✅ This approach achieves Pareto improvement, effectively weakening dangerous capabilities without negatively impacting benign ones.
Experimental Validation and Impact
- 🔬 Experiments successfully demonstrated precise capability cutting, removing medical knowledge while preserving biological knowledge.
- 🚀 The effectiveness of token-level filtering scales exponentially with model size, increasing attack costs by 7000-fold for 1.8 billion parameter models.
- 💪 Token filtering proved 13 times more robust to adversarial fine-tuning than state-of-the-art RMU techniques for large models.
- 📊 Models trained with token filtering showed near-random performance on medical tests but maintained high accuracy on biological and general intelligence tasks.
Technical Implementation Details
- 🔑 High-quality seed labels are generated using sparse autoencoders to detect independent concepts, classified by models like Claude Sonnet 4.
- 🧠 A small, bidirectional language model classifier (224M parameters) is trained to accurately identify dangerous tokens, leveraging context for precision.
- 🧩 A two-stage filtering process combines a document-level classifier for initial coarse screening with the token-level classifier for fine-grained precision.
- 🌱 The method exhibits weak-to-strong generalization, allowing high-performance token-level classifiers to be trained effectively even with noisy or coarse-grained labels.
Reshaping AI Safety Paradigms
- 🎯 This research shifts AI safety from reactive post-processing to proactive pre-training filtering, enabling "surgical" capability shaping.
- 🧬 It allows for "gene editing" of AI capabilities, precisely controlling what knowledge the AI can learn from its inception.
- 🌐 The approach offers a path to build inherently harmless large models, preventing them from ever acquiring dangerous knowledge.
- 🔮 Future work will explore unsupervised labeling, cross-language filtering, and addressing knowledge acquired through external tools.
Knowledge graph30 entities · 23 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover · drag to explore
30 entities
Chapters10 moments
Key Moments
Transcript89 segments
Full Transcript
Topics15 themes
What’s Discussed
Token-level data filteringAI safetyPre-trainingLarge language modelsMachine unlearningAdversarial fine-tuningSparse autoencodersBidirectional language modelsWeak-to-strong generalizationPareto improvementLoss maskingDirect removalKnowledge fragmentsComputational costsModel capabilities
Smart Objects30 · 23 links
Concepts· 20
Medias· 6
People· 2
Company· 1
Product· 1