Annus Mirabilis: A Year of Explosive Progress in LLMs with Benjamin Feuer
[HPP] Ludwig SchmidtJanuary 28, 20261h 33min
33 connectionsΒ·40 entities in this videoβThe "Annus Mirabilis" of LLMs
- π‘ The term "Annus Mirabilis" (year of miracles), popularized by Isaac Newton's breakthroughs, now describes the explosive progress in Large Language Models (LLMs) since ChatGPT's public debut.
- π LLMs have evolved rapidly, moving from "stochastic parrots" to smooth, quasi-agentic systems that rarely surprise users unless pushed to out-of-distribution tasks.
- π§ This rapid advancement has forced skeptics to repeatedly shift their goalposts regarding LLM capabilities.
Challenges with LLM Judges and Benchmarks
- π― Current LLM alignment progress is often measured using LLM judges and benchmarks like Alpaca Eval and MTBench, which score outputs based on pairwise interactions.
- β οΈ A key issue is that LLM judges implicitly reweight explicit judgment criteria, prioritizing aspects like style over conciseness, even when they "understand" all criteria.
- π Experiments show that LLMs can severely penalize stylistic infractions (e.g., sarcasm leading to a 96% score decrease) more than factual errors, highlighting a bias in evaluation.
- π§© The numeric scales used by LLM judges are unstable and unpredictable, with different judges scoring outputs inconsistently and exhibiting "quirky artifacts of behavior."
Mitigating Data Contamination
- π Static benchmarks are prone to contamination, where training data inadvertently includes test answers, leading to inflated performance metrics that don't reflect true generalization.
- π Contamination is complex, ranging from literal copy-pasting to rephrased answers, and standard detection methods like n-gram decontamination are often insufficient.
- π Dynamic benchmarks, such as LiveBench, address contamination by regularly updating test prompts and questions over time, ensuring models are evaluated on novel challenges.
Enhancing LLMs with Synthetic Data
- π± Historically, models trained on synthetic data underperformed those trained on human-authored data, but recent work like Tab PFN has challenged this notion.
- π Research on fine-tuning LLMs with synthetic data, using large datasets like WildChat, shows that the choice of the data-generating model (DGM) is highly impactful.
- β Models fine-tuned with strong DGMs demonstrate significant improvements in instruction following and stylistic aspects, which are highly valued by LLM judges.
- π While synthetic data improves stylistic and instruction-following capabilities, it does not necessarily transfer benchmark performance on hard topics directly via distillation.
Future Directions and Open Questions
- π‘ Key future challenges include ensuring LLMs can ground responses in provided context, especially with long and complex inputs like tables.
- βοΈ Developing alignment strategies for domains with multifaceted ground truth, such as law or medicine, remains difficult due to the complexity of human reasoning in these areas.
- π There's a critical need for fully open-source LLMs with transparent data, especially concerning the use of copyrighted data for training high-performing models.
- π¨ The speaker expresses skepticism about benchmarking creativity in "wicked problems" like filmmaking, arguing that true creativity involves breaking rules and transforming perception, which is hard to quantify.
Knowledge graph40 entities Β· 33 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover Β· drag to explore
40 entities
Chapters17 moments
Key Moments
Transcript346 segments
Full Transcript
Topics15 themes
Whatβs Discussed
Large Language Models (LLMs)LLM AlignmentBenchmarkingLLM JudgesData ContaminationSynthetic DataSupervised Fine-Tuning (SFT)Dynamic BenchmarksInstruction FollowingOpen Source ModelsCreativityImplicit BiasReasoning ModelsGround TruthData Scaling
Smart Objects40 Β· 33 links
PeopleΒ· 7
ConceptsΒ· 16
MediasΒ· 9
CompaniesΒ· 3
EventΒ· 1
ProductsΒ· 4