Accurate diagnosis of hemoglobinopathies with machine learning based on high-throughput proteomics.
Shaodong Wei, Annelaura Bach Nielsen, Jens Helby, Lylia Drici, Christine Rasmussen, Juanjuan Wang, Matthias Mann, Jesper Petersen, Nicolai J Wewer Albrechtsen, Andreas Glenthøj
Abstract
Open AccessHemoglobinopathies, such as sickle cell disease and thalassemias, impose a substantial global burden, particularly in endemic regions. Current diagnostic methods, such as high-performance liquid chromatography (HPLC), capillary electrophoresis, and genetic testing, can be time-consuming, expensive, or limited in detecting all variants. This study introduces a novel diagnostic framework that combines high-throughput proteomics with machine learning to address these challenges. We processed red blood cells, whole blood, and plasma samples from 82 individuals (development cohort) and 45 individuals (validation cohort) with structural hemoglobin variants (hemoglobin S, hemoglobin C, hemoglobin D, and hemoglobin E) or β-thalassemia trait, as confirmed by standard clinical testing. Tryptic peptides were analyzed using data-independent acquisition mass spectrometry, and random forest classifiers were trained to identify structural variants or β-thalassemia trait. Model performance was evaluated across 100 Monte Carlo cross-validations. For structural variants, the classifier achieved an area under the receiver-operating characteristic curve (AUC) of 1.000 and 99.9% prediction accuracy in the validation cohort, when comparing our proteomics-based diagnostics to standard testing with HPLC and Sanger sequencing (gold standard). For β-thalassemia trait, the mean AUC was 1.000, and the prediction accuracy was 96.9% in the validation cohort, and a single peptide alone yielded 92% accuracy in a simple decision tree. This high-throughput proteomics approach offers a rapid, scalable, and potentially cost-effective alternative to existing diagnostic workflows, requiring minimal sample preparation while reducing manual interpretation. By combining peptide-level data with machine learning, it enables precise classification of hemoglobinopathies and demonstrates a compelling path for routine clinical evaluation of hereditary anemias.