[CS] Kyle Hippe MS Presentation/Jul 7, 2025
via cs
cs at mailman.cs.uchicago.edu
Mon Jun 30 09:35:13 CDT 2025
This is an announcement of Kyle Hippe's MS Presentation
===============================================
Candidate: Kyle Hippe
Date: Monday, July 07, 2025
Time: 1 pm CST
Remote Location: https://uchicago.zoom.us/j/94061576096?pwd=AOcgJ0UPnvsryUr4VL6ahbKcGPexC1.1
Location: JCL 298
Title: Bias in the Branches: A Benchmark on Phylogenetic Representation Gaps in Language Models
Abstract: Protein and genome language models are trained on corpora that are highly taxonomically imbalanced. Yet, benchmarks seldom quantify how that imbalance affects the representations they learn. To address this, we introduce a novel benchmark with both amino acid and nucleotide-level sequences that systematically pairs representatives from common and rare taxonomic families across all domains of life. By maintaining identical label distributions between paired sets, we isolate taxonomic effects and test three biological tasks: homology detection, function annotation, and structural fold classification. Our analysis reveals significant bias patterns within and across domains, with differences in performance up to 24% between common and rare taxa, and varies systematically by task type. We demonstrate directional asymmetry in cross-taxonomic transfer, where models trained on rare taxa generalize better to common taxa for homology and function, while structure classification shows the reverse pattern. These results indicate model architecture and dataset curation prove more influential than parameter count for taxonomic generalization. This work provides both metrics and insights for developing more inclusive and generalizable foundation models for computational biology
Advisors: Rick Stevens
Committee Members: Rick Stevens, Ian Foster, Arvind Ramanathan and Aly Khan
More information about the cs
mailing list