Greater value add from electronic health records than polygenic risk scores for predicting myocardial infarction in machine learning.
Monica Isgut, Andrew Hornback, Han Bao, Yiting Sun, Katherine Choi, Blake J Anderson, Shriprasad R Deshpande, Anthony C Chang, May D Wang
Abstract
Open AccessBACKGROUND: Polygenic risk scores (PRSs) are increasingly being used to predict disease risk from genetic data. While promising in research, their clinical utility-especially when combined with non-genetic (NG) data such as lab results, physical measurements, and diagnostic history-remains uncertain. Myocardial infarction (MI), a leading cause of morbidity and mortality, is a key use case for assessing the incremental value of PRSs in risk models. METHODS: Using UK Biobank data, we evaluated the added value of PRSs for 10-year MI risk prediction. We trained models with NG data alone and in combination with PRSs, varying model complexity and the NG feature space. Two modeling frameworks were used: logistic regression and a neural network. NG data was defined using two feature sets: NG1, which included established MI risk factors from structured fields; and NG2, a high-dimensional dataset derived from millions of diagnostic codes across five linked UK Biobank electronic health records (EHR) datasets combined with NG1 features. NG2 was generated using a deep representation learning approach that produced low-dimensional embeddings capturing latent medical concepts and disease co-occurrence patterns. Each model was trained with and without PRSs and evaluated using metrics such as the area under the ROC curve (AUC). RESULTS: PRSs add minimal predictive value when used alone. In contrast, diagnostic data from EHRs significantly improve performance. The best results are achieved using a multimodal neural network combining NG1, NG2, and PRSs. CONCLUSIONS: PRSs provide limited standalone utility for MI prediction compared to detailed diagnostic data. Their clinical value likely lies in integration with EHR-based models. Future work should focus on multi-modal approaches that contextualize PRS information within broader clinical data.