Annus Mirabilis: A Year of Explosive Progress in LLMs with Benjamin Feuer

[HPP] Ludwig SchmidtJanuary 28, 20261h 33min

33 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The "Annus Mirabilis" of LLMs

💡 The term "Annus Mirabilis" (year of miracles), popularized by Isaac Newton's breakthroughs, now describes the explosive progress in Large Language Models (LLMs) since ChatGPT's public debut.
🚀 LLMs have evolved rapidly, moving from "stochastic parrots" to smooth, quasi-agentic systems that rarely surprise users unless pushed to out-of-distribution tasks.
🧠 This rapid advancement has forced skeptics to repeatedly shift their goalposts regarding LLM capabilities.

Challenges with LLM Judges and Benchmarks

🎯 Current LLM alignment progress is often measured using LLM judges and benchmarks like Alpaca Eval and MTBench, which score outputs based on pairwise interactions.
⚠️ A key issue is that LLM judges implicitly reweight explicit judgment criteria, prioritizing aspects like style over conciseness, even when they "understand" all criteria.
📊 Experiments show that LLMs can severely penalize stylistic infractions (e.g., sarcasm leading to a 96% score decrease) more than factual errors, highlighting a bias in evaluation.
🧩 The numeric scales used by LLM judges are unstable and unpredictable, with different judges scoring outputs inconsistently and exhibiting "quirky artifacts of behavior."

Mitigating Data Contamination

📈 Static benchmarks are prone to contamination, where training data inadvertently includes test answers, leading to inflated performance metrics that don't reflect true generalization.
🔍 Contamination is complex, ranging from literal copy-pasting to rephrased answers, and standard detection methods like n-gram decontamination are often insufficient.
🔄 Dynamic benchmarks, such as LiveBench, address contamination by regularly updating test prompts and questions over time, ensuring models are evaluated on novel challenges.

Enhancing LLMs with Synthetic Data

🌱 Historically, models trained on synthetic data underperformed those trained on human-authored data, but recent work like Tab PFN has challenged this notion.
🚀 Research on fine-tuning LLMs with synthetic data, using large datasets like WildChat, shows that the choice of the data-generating model (DGM) is highly impactful.
✅ Models fine-tuned with strong DGMs demonstrate significant improvements in instruction following and stylistic aspects, which are highly valued by LLM judges.
📊 While synthetic data improves stylistic and instruction-following capabilities, it does not necessarily transfer benchmark performance on hard topics directly via distillation.

Future Directions and Open Questions

💡 Key future challenges include ensuring LLMs can ground responses in provided context, especially with long and complex inputs like tables.
⚖️ Developing alignment strategies for domains with multifaceted ground truth, such as law or medicine, remains difficult due to the complexity of human reasoning in these areas.
🌐 There's a critical need for fully open-source LLMs with transparent data, especially concerning the use of copyrighted data for training high-performing models.
🎨 The speaker expresses skepticism about benchmarking creativity in "wicked problems" like filmmaking, arguing that true creativity involves breaking rules and transforming perception, which is hard to quantify.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 33 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters17 moments

Key Moments

Transcript346 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Large Language Models (LLMs)LLM AlignmentBenchmarkingLLM JudgesData ContaminationSynthetic DataSupervised Fine-Tuning (SFT)Dynamic BenchmarksInstruction FollowingOpen Source ModelsCreativityImplicit BiasReasoning ModelsGround TruthData Scaling

Smart Objects40 · 33 links

People· 7

Concepts· 16

Medias· 9

Companies· 3

Event· 1

Products· 4

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free