Skip to main content

Yoshua Bengio - Disentangling Agency & Predictive Power Without Solving ELK [Alignment Workshop]

[HPP] Yoshua BengioFebruary 18, 202630 min
25 connections·40 entities in this video→

Building Trustworthy AI

  • πŸ’‘ The Scientist AI research program aims to create highly trustworthy machines that deeply understand the world.
  • 🎯 This approach seeks to disentangle an AI's understanding of the world from its goals and desires, unlike human intelligence.
  • 🧠 The goal is to develop AI predictors that approximate Bayesian posteriors without developing agency or preferences for specific outcomes.

Addressing Implicit Goals & Agency

  • ⚠️ Current AI systems, particularly those using Reinforcement Learning or pre-training, can develop implicit, unchosen goals that might be harmful.
  • πŸ”‘ Agency is defined as the robust achievement of goals despite randomness or adversaries, and dangerous agency involves intentions to harm.
  • πŸ“Š The speaker claims that agentic predictors occupy an exponentially small volume in the space of all possible predictors, making their accidental emergence during training unlikely.

The Truthification Pipeline

  • πŸ› οΈ A "truthification pipeline" transforms training data by using distinct syntax for verified facts (e.g., "X is true") versus claims made by people (e.g., "someone wrote X").
  • βœ… This separation allows the AI to understand what is trustworthy and enables users to query what the AI genuinely believes, rather than what a human might say.
  • πŸ” This mechanism is crucial for ensuring the AI is not agentic and for distinguishing between factual reality and communication acts.

Epistemic Correctness & ELK

  • πŸ’‘ When the Scientist AI issues a high-confidence claim, it is designed to be "epistemically correct," meaning it does not lie, even if it might withhold some knowledge.
  • 🎯 This provides sufficient safety guarantees, even if it doesn't fully solve the challenge of Eliciting Latent Knowledge (ELK).
  • πŸ“ˆ Uncertain statements, such as future harm, can be reformulated as certain statements about probabilities (e.g., "the probability of harm is X"), making them amenable to this trustworthy prediction.

Training & Future Implications

  • 🚫 Training must ensure no two-way interaction between the AI and the world (e.g., no online Reinforcement Learning) to prevent the AI from influencing its environment.
  • πŸš€ This research program is the foundation for LawZero, a new nonprofit dedicated to developing this type of AI.
  • 🧩 The Scientist AI serves as a critical building block or "guardrail" for constructing broader, safe agentic systems, with humans ultimately defining ethical boundaries.
Knowledge graph40 entities Β· 25 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover Β· drag to explore
40 entities
Chapters12 moments

Key Moments

Transcript113 segments

Full Transcript

Topics13 themes

What’s Discussed

Scientist AIAI AgencyPredictive PowerEliciting Latent Knowledge (ELK)Bayesian PosteriorTruthification PipelineLatent VariablesEpistemic CorrectnessImplicit GoalsReinforcement LearningLawZeroSuperintelligenceCompositional Generalization
Smart Objects40 Β· 25 links
PeopleΒ· 3
ConceptsΒ· 29
CompaniesΒ· 2
MediasΒ· 3
EventΒ· 1
LocationsΒ· 2