Rigorous Evaluation of LLM Pretraining Optimizers: Debunking Speedup Claims

[HPP] Percy LiangJanuary 22, 202614 min

15 connections·23 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Pretraining Optimization Debate

💡 Pre-training optimization for Large Language Models (LLMs) is a major and expensive debate in machine learning, consuming over 95% of total project costs for models like Deepseek V3.
🚀 While AdamW has been the default optimizer, many new algorithms like Muon, Sophia, and SOAP claim 1.4x to 2x speedups.
⚠️ Despite these sensational claims, widespread adoption of new optimizers has not occurred, indicating industry-wide skepticism.

Methodological Flaws & Rigorous Evaluation

🔍 Previous evaluations suffered from two critical flaws: unequal or insufficient hyperparameter tuning and limited evaluation setups (small models, low data regimes).
🎯 A systematic study compared 11 optimizers across four Llama 2 model scales (0.1B to 1.2B parameters) and varied data regimes (1x to 8x Chinchilla optimum).
🛠️ Rigorous three-phase coordinate descent was used for hyperparameter tuning, revealing that a simple learning rate tweak for AdamW could yield a 2x speedup against a weakly tuned baseline.
🔑 Optimal hyperparameters are highly specific and non-transferable between optimizers, making fixed settings or blind transfers unfair.
⏳ Intermediate checkpoints are misleading; evaluations must be performed at the target training budget as optimizer rankings can flip during training.

Key Findings on Optimizer Performance

📈 Against a properly tuned AdamW baseline, the highest observed speedup for any alternative optimizer was capped at 1.4x.
📉 This speedup decays dramatically with model size, reducing from 1.3x for 0.1B models to a mere 1.1x for 1.2B parameter models in high data regimes.
🔮 A fitted scaling law suggests that at frontier scales (e.g., 7B parameters), some advanced optimizers like Muon might lead to a higher final validation loss than a well-tuned AdamW.

Scaling Challenges and Structural Insights

🧩 Optimizers are categorized into scalar-based (e.g., AdamW, Lion) and matrix-based (e.g., Muon, SOAP, Cron), with matrix methods leveraging structural information for gradient preconditioning.
⚡ Matrix-based methods showed 1.3x gains at smaller scales (under 520M parameters) but their efficiency gain is limited by increased computational overhead as model size grows.
📊 The optimal optimizer depends on the data regime: Muon excels in data-limited settings (1x-4x Chinchilla), while SOAP and Cron perform better in data-dense or overtrained settings (8x-16x Chinchilla) due to second-order momentum.
✅ Optimizer choice primarily affects training speed, not the fundamental generalization properties or final structural outcome of the model.

Operational Takeaways & Future Directions

💡 The most significant and safest efficiency gain comes from rigorous hyperparameter tuning of existing AdamW baselines, often negating the need for complex migrations.
⚠️ Matrix-based optimizers offer modest, decaying advantages that may not justify migration costs for large models.
🎯 Always evaluate at the target training budget to avoid misleading rankings from early-stage checkpoints.
🚀 The key open problem is designing optimizers with stable efficiency gains that do not diminish at trillion-parameter scales.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph23 entities · 15 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

23 entities

Chapters2 moments

Key Moments

Transcript53 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Large Language Models (LLMs)Pre-training OptimizationAdamW OptimizerHyperparameter TuningOptimizer EvaluationModel ScalingData RegimesCoordinate DescentLearning Rate DecayScalar-Based OptimizersMatrix-Based OptimizersComputational OverheadGeneralization PropertiesScaling LawsDeep Learning Optimization

Smart Objects23 · 15 links

Concepts· 20

Medias· 2

Person· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free