AI Auditing and Evaluation: Beyond Benchmarking with Inioluwa Deborah Raji

[HPP] Timnit GebruJanuary 22, 20261h 3min

21 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

The Evolving Role of AI Evaluation

💡 AI systems are increasingly deployed in critical domains like criminal justice and healthcare, leading to real human costs when they fail.
📌 Historically, consumer protection movements have driven product safety evaluations, a model relevant to today's AI landscape.
🚀 AI evaluation has shifted from merely ranking algorithms to playing a broader role in product deployment, documentation, and legal evidence.

Auditing for Accountability

🎯 The AI auditing process involves identifying harms, evaluating against standards, communicating results, advocating for change, and ensuring legal accountability through consequences.
🔬 The Gender Shades project demonstrated how evaluating facial recognition bias led to significant changes in documentation, procurement, and legal actions.
📈 Postmarket surveillance methods, inspired by vaccine adverse event reporting, can statistically identify disproportionately harmed subgroups in AI systems.

Beyond Traditional Benchmarking

⚠️ Traditional AI benchmarks often suffer from a construct validity challenge, failing to accurately represent real-world performance or generalizability.
🧠 Realistic evaluations, like using patient notes for clinical LLMs, reveal that models perform differently and often worse than in idealized test scenarios.
🧩 AI deployments should be viewed as policy interventions, where experiment design choices significantly impact user responsiveness and the accurate measurement of causal effects.

Operationalizing AI Audits

🛠️ Policy engagement is crucial for translating new evaluation methods into effective AI audit implementation, addressing data access and standard setting.
✅ Model cards and other documentation practices are becoming essential for clinical decision support tools and government AI use, ensuring transparency and accountability.
🌐 Multi-disciplinary collaboration and AI Safety Institutes are vital for developing the necessary technical and institutional infrastructure for safe and widespread AI adoption.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 21 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters18 moments

Key Moments

Transcript235 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

Algorithmic auditingMachine learning evaluationBenchmarking paradigmFacial recognition systemsLarge Language Models (LLMs)Consumer protectionPostmarket surveillanceAI incidentsPolicy evaluationConstruct validityExperiment designJudge responsivenessCausal inferenceModel cardsAI Safety Institutes

Smart Objects40 · 21 links

People· 5

Concepts· 17

Companies· 10

Products· 3

Medias· 2

Events· 2

Location· 1

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free