Revealing bias in antibody language models through systematic training data processing with OAS-explore
Abstract
Antibody language models (LMs) trained on immune receptor sequences have been applied to diverse immunological tasks such as humanization and prediction of antigen specificity. While promising, these models are often trained on datasets with limited donor diversity, raising concerns that biases in the training data may hinder their generalizability. To quantify the impact of biased training data, we introduce an open-source processing pipeline for the 2.4 billion unpaired antibody sequences in the Observed Antibody Space (OAS) database, enabling customizable filtering and balanced sampling by donor, species, chain type and other metadata. Analysis of OAS revealed that 13 individuals contribute over 70% of human antibody sequences. Using our pipeline, we trained 17 RoBERTa antibody LMs on datasets of different compositions. Models failed to generalize across chain types and showed limited transfer between human and mouse repertoires. Both individual- and batch-specific effects influenced model performance, and expanding donor diversity did not improve generalization to unseen individuals from unseen publications.
Results
Comparison of models trained on mouse, human and mixed datasets a. MLM loss on test sets of 100k sequences for models trained with varying species and chain composition. Y-axis: training data origin; x-axis: test data origin. b. Average AA-likelihoods for sequences representing 5% of a mouse bone marrow repertoire from [16]. Sequences are colored by chain type; x-axis specifies model training data composition.
Impact of training data diversity on model performance a. MLM loss of models trained on sequences from 1 individual (HIP-1, HIP-2, HIP-3), 3 individuals (Soto-All) or 630 individuals (OAS-wo-Soto), evaluated on test sets corresponding to each training configuration. b. Average MLM loss on sequences from held-out individuals. Subject-237, -1009, -1212, and -1848 are from vaccine studies by the same research group. c. Average humanization scores across 25 antibodies
Poster
BibTeX
@article{OAS-explore,
title={Revealing bias in antibody language models through
systematic training data processing with OAS-explore},
author={Wiona Sophie Glänzer and Sai T. Reddy and Alexander Yermanos},
journal={2nd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences at NeurIPS 2025},
year={2025},
url={https://openreview.net/pdf?id=JkKS5vvWLd}
}