Skip to main content

Token-Level Data Filtering: Preventing Dangerous AI Capabilities in Pre-training

[HPP] Alec RadfordFebruary 14, 202624 min
23 connections·30 entities in this video

Challenges of Traditional AI Safety

  • ⚠️ Post-processing methods like RLHF and SFT only suppress dangerous knowledge output, not remove it from the model's knowledge base.
  • 💥 Machine unlearning techniques (e.g., RMU) are fragile and can be easily bypassed by adversarial fine-tuning, causing forgotten knowledge to "rebound."
  • 🧠 Large models' distributed knowledge storage and strong generalization make surgical removal of dangerous information post-training extremely difficult.
  • 🗑️ Document-level data filtering is inefficient, leading to the deletion of valuable benign data and failing to catch dangerous tokens embedded within benign documents.

Introducing Token-Level Data Filtering

  • 💡 The core innovation is to precisely filter data at the token level during pre-training, rather than at the document level.
  • 🎭 Loss Masking allows the model to see all tokens for context but prevents it from learning from dangerous tokens by ignoring their gradient contributions.
  • ✂️ Direct Removal replaces dangerous tokens with a special hidden placeholder, completely cutting off the model's exposure to them.
  • ✅ This approach achieves Pareto improvement, effectively weakening dangerous capabilities without negatively impacting benign ones.

Experimental Validation and Impact

  • 🔬 Experiments successfully demonstrated precise capability cutting, removing medical knowledge while preserving biological knowledge.
  • 🚀 The effectiveness of token-level filtering scales exponentially with model size, increasing attack costs by 7000-fold for 1.8 billion parameter models.
  • 💪 Token filtering proved 13 times more robust to adversarial fine-tuning than state-of-the-art RMU techniques for large models.
  • 📊 Models trained with token filtering showed near-random performance on medical tests but maintained high accuracy on biological and general intelligence tasks.

Technical Implementation Details

  • 🔑 High-quality seed labels are generated using sparse autoencoders to detect independent concepts, classified by models like Claude Sonnet 4.
  • 🧠 A small, bidirectional language model classifier (224M parameters) is trained to accurately identify dangerous tokens, leveraging context for precision.
  • 🧩 A two-stage filtering process combines a document-level classifier for initial coarse screening with the token-level classifier for fine-grained precision.
  • 🌱 The method exhibits weak-to-strong generalization, allowing high-performance token-level classifiers to be trained effectively even with noisy or coarse-grained labels.

Reshaping AI Safety Paradigms

  • 🎯 This research shifts AI safety from reactive post-processing to proactive pre-training filtering, enabling "surgical" capability shaping.
  • 🧬 It allows for "gene editing" of AI capabilities, precisely controlling what knowledge the AI can learn from its inception.
  • 🌐 The approach offers a path to build inherently harmless large models, preventing them from ever acquiring dangerous knowledge.
  • 🔮 Future work will explore unsupervised labeling, cross-language filtering, and addressing knowledge acquired through external tools.
Knowledge graph30 entities · 23 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
30 entities
Chapters10 moments

Key Moments

Transcript89 segments

Full Transcript

Topics15 themes

What’s Discussed

Token-level data filteringAI safetyPre-trainingLarge language modelsMachine unlearningAdversarial fine-tuningSparse autoencodersBidirectional language modelsWeak-to-strong generalizationPareto improvementLoss maskingDirect removalKnowledge fragmentsComputational costsModel capabilities
Smart Objects30 · 23 links
Concepts· 20
Medias· 6
People· 2
Company· 1
Product· 1