Portfolio

Hi! I'm Wiona, a computational biologist researching language models for immunology.

GitHub Hugging Face LinkedIn Startup Game

Antibody language models (LMs) trained on immune receptor sequences have been applied to diverse immunological tasks such as humanization and prediction of antigen specificity. While promising, these models are often trained on datasets with limited donor diversity, raising concerns that biases in the training data may hinder their generalizability. To quantify the impact of biased training data, we introduce an open-source processing pipeline for the 2.4 billion unpaired antibody sequences in the Observed Antibody Space (OAS) database, enabling customizable filtering and balanced sampling by donor, species, chain type and other metadata. Analysis of OAS revealed that 13 individuals contribute over 70% of human antibody sequences. Using our pipeline, we trained 17 RoBERTa antibody LMs on datasets of different compositions. Models failed to generalize across chain types and showed limited transfer between human and mouse repertoires. Both individual- and batch-specific effects influenced model performance, and expanding donor diversity did not improve generalization to unseen individuals from unseen publications.

Learn more

Antibodies are one of the most important proteins of the adaptive immune system. They specifically recognize and bind to antigens to block their function, remove them from the bloodstream, or mark pathogens for immune cells. To recognize a variety of antigens, antibodies exhibit an immense genetic diversity of over 10¹³ unique sequences in humans and evolve within each organism in response to antigen exposure. Understanding the mechanisms behind the generation of antibody sequences is fundamental to illuminating the function of our immune system and leveraging it for the treatment of disease.

For my Master's thesis I trained antibody language models on data from the Observed Antibody Space (OAS) database and compared the performance when training data from different species is used.

Glioblastoma is a hard-to-treat malignant brain tumor characterized by high cellular diversity. Previous research has identified expression meta-modules that define cellular states within Glioblastoma and cells that appear to be in transition between these states. In this project, I tested whether trajectory inference methods, developed to analyze cell differentiation processes, could be utilized to identify trajectories connecting these cell states. While some reservations remain about the validity of this approach, the tested methods successfully identified the expected trajectories and revealed gene expression changes along them. These findings suggest that trajectory inference techniques might offer valuable insights into Glioblastoma’s cellular heterogeneity.

Record-seq is a recently developed technology that enables bacterial cells to record their own transcriptional activity into DNA and preserve it over time. Record-seq readout consists of deep sequencing and data interpretation using a dedicated computational analysis pipeline. In this project made improvements to and conducted experiments with the Record-seq data analysis pipeline. First, I worked on the primary analysis pipeline, augmenting its capability to detect recorded sequences from reads using updated approximate string matching methods. Next, I tested several methods for normalizing counts distributions during secondary analysis to apply differential expression analysis tools developed for RNA sequencing. Two new normalization methods, namely plasmid normalization, and percentile normalization were implemented.