Challenges in predicting protein-protein interactions of understudied viruses: Arenavirus-human interactions.
Harshita Sahni, Sarah Michelle Crotzer, Juston Moore, Steven S Branda, Trilce Estrada, S Gnanakaran
Abstract
Open AccessUnderstanding protein-protein interactions (PPIs) between viruses and host organisms is crucial for uncovering infection mechanisms and identifying potential therapeutic targets. The ability to generalize PPI predictive models across understudied viruses presents a significant challenge. In this work, we use arenavirus-human PPIs to illustrate the difficulties associated with model generalization, which are compounded by a lack of both positive and negative data. We employ a Transfer Learning approach to investigate arenavirus-human PPIs by utilizing models trained on better-studied virus-human and human-human PPIs. Additionally, we curate and assess four types of negative sampling datasets to evaluate their impact on model performance. Despite the overall high accuracies (93-99 %) and AUPRC scores (0.8-0.9) appearing promising, further analysis indicates that these performance metrics can be misleading due to data leakage, data bias, and overfitting, especially concerning under-represented viral proteins. We reveal these gaps and assess the impact of data imbalance using standard k-fold cross-validation and Independent Blind Testing with a Balanced Dataset, resulting in a drop in accuracy below 50 %. We propose a viral protein-specific evaluation framework that categorizes viral proteins into majority and minority classes based on their representation in the dataset, enabling comparison of model performance across these groups using balanced accuracies. This framework offers a more robust evaluation of model generalizability, addressing biases inherent in standard evaluation techniques and paving the way for more reliable PPI prediction models for understudied viruses.