Skip to main content

Benchmarking Model Performance on Pandemic-Threat Viruses with Sarah Gurev

[HPP] Debora MarksOctober 29, 20251h 1min
31 connections·40 entities in this video

Challenges in Viral Prediction

  • 🦠 Viruses pose a significant threat due to rapid evolution and adaptability, making accurate mutation effect prediction crucial.
  • 🔬 While machine learning and sequence data offer promise, viruses present unique biological and informational constraints that challenge existing models.

Evolutionary & Alignment-Based Models

  • 💡 Alignment-based models learn from evolutionary sequences (e.g., pre-2020 coronaviruses) to predict mutation effects, considering site-independent, pairwise, and higher-order interactions.
  • 🧬 The Evecape model combines evolutionary sequences with structural and biophysical information to predict antibody escape mutations, demonstrating its ability to forecast future SARS-CoV-2 variants more effectively than some experimental methods.
  • 💉 These models can aid in vaccine design by predicting likely future variants, helping to create vaccines that offer long-term protection against evolving viruses like SARS-CoV-2 and influenza.

Protein Language Models & Data Constraints

  • 🧠 Protein language models (PLMs), while state-of-the-art for many protein tasks, often perform poorly for viruses due to underrepresentation of viral sequences in training datasets like UniRef.
  • 📊 Clustering methods (e.g., UniRef90, UniRef50) disproportionately reduce the number of viral sequences, leading to less effective training data for viral prediction.
  • 📈 For viruses, larger PLMs (more parameters) continue to improve performance, suggesting they compensate for the lack of specific training data by generalizing from non-viral information.

Improving Model Performance

  • 🏗️ The EVEREST framework (Evolutionary Variant Effect prediction with Reliability ESTimation) was introduced to systematically assess model performance and reliability for viral mutational fitness prediction.
  • 🧬 Structural information can significantly enhance PLM performance for viruses, particularly for stability assays, by allowing models to learn from remote homologs with high structural similarity despite low sequence identity.
  • Alignment relevance, focusing on sequences with high identity to the query, is a more effective strategy for selecting alignments than simply using deeper alignments, which can dilute signal with irrelevant sequences.

Reliability and Applications

  • 🔑 Reliability metrics, such as sequence diversity for alignment models and pseudo-perplexity for PLMs, can predict model performance and indicate when a model's predictions for a new virus are trustworthy.
  • ⚠️ Evaluation of 40 WHO-prioritized pandemic-threat viruses revealed that current models fail to reliably predict mutations for over half of them, highlighting critical gaps.
  • 🚀 The findings offer actionable recommendations for improving viral mutation effect prediction, supporting pandemic preparedness, vaccine design, and objective assessment of biosecurity risks.
Knowledge graph40 entities · 31 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
40 entities
Chapters19 moments

Key Moments

Transcript229 segments

Full Transcript

Topics15 themes

What’s Discussed

Viral EvolutionMachine LearningDeep Mutational ScanningProtein Language ModelsAntibody EscapeVaccine DesignSARS-CoV-2InfluenzaSequence AlignmentStructural InformationReliability EstimationPandemic PreparednessBiosecurity RiskWHO Priority VirusesVariational Autoencoders
Smart Objects40 · 31 links
Person· 1
Products· 4
Concepts· 26
Event· 1
Medias· 5
Companies· 3