BRCAGenie: A machine learning-driven 43-gene polygenic risk score model for precision prediction of breast cancer survival.
Jia Wei Lee, Ashley Jun Wei Lim, Chengyi Wang, Jun Hao Neo, Lee Jin Lim, Samuel S Chong, Caroline G Lee
Abstract
Open AccessBACKGROUND: Breast cancer is one of the most prevalent malignancies globally, imposing a substantial disease burden. Its inherent heterogeneity complicates prognosis and treatment, underscoring the need for accurate survival prediction models to guide personalised clinical decision-making. Existing studies on breast cancer survival prediction models based on gene expression have largely focused on predefined subsets of genes, such as biologically-relevant, differentially-expressed or protein-coding genes, potentially overlooking critical but uncharacterised genomic contributors. To probe deeper beyond well-characterised biological pathways and differential gene expression, this study adopts an unbiased, hypothesis-free approach by conducting machine learning-based survival analysis on the entire transcriptome, rather than restricted gene subsets, to identify a novel gene signature for the development of a robust and accurate risk score model for breast cancer survival prediction. METHODS: The clinical and transcriptome data from The Cancer Genome Atlas Breast Cancer (TCGA-BRCA) were analysed. Feature selection was conducted using univariate Cox, LASSO Cox regression and stepwise selection. The selected gene signature was then used to construct a risk score model, following which model performance was evaluated using AUC(t), Kaplan-Meier curves (with the Log-Rank test), Harrell's C-index and calibration plots. The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) and Gene Expression Omnibus (GSE96058) cohorts served as external validation sets. RESULTS: Here, we present Breast Cancer Genie (BRCAGenie), a 43-gene polygenic risk score model that can reasonably predict breast cancer survival using gene expression, achieving three-year and five-year AUC(t) scores of 0.736 and 0.751 respectively in the TCGA unseen test set and significant separation of the Kaplan-Meier curves. When age at diagnosis was included, the three-year and five-year AUC(t) scores of the combined BRCAGenie model improved to 0.812 and 0.814 respectively in the unseen test set. The identified signature demonstrates promising biological relevance, with many genes supported by existing cancer literature, and others representing potentially novel biomarkers. To facilitate clinical application, the BRCAGenie calculator was developed: https://jiaweilee.shinyapps.io/breast_cancer_survival_43/ . CONCLUSIONS: The development of an accurate breast cancer polygenic risk score model for survival prediction enables robust prognostic stratification for personalised clinical decision-making. Through our unbiased, hypothesis-free approach, we have identified novel breast cancer prognostic biomarkers with the potential to be clinically validated and translated into relevant diagnostics and therapeutics.