Precision in prediction: tailoring machine learning models for breast cancer missense variants pathogenicity prediction.
Rahaf M Ahmad, Noura AlDhaheri, Mohd Saberi Mohamad, Bassam R Ali
Abstract
Open AccessAccurate classification of genetic variants is critical for precision medicine, particularly hereditary diseases such as breast cancer. However, widely used tools like MutPred and Combined Annotation Dependent Depletion (CADD) offer genome-wide pathogenicity predictions that often overlook disease-specific variant behavior, limiting their clinical utility. This study addresses that gap by training and benchmarking nine machine learning (ML) models-including ensemble and baseline classifiers-on a breast cancer gene-specific dataset rich in conservation scores, functional annotations, and allele frequency features. Among all models, the Extra Trees model achieved the highest performance, with an accuracy of 0.999 and a 95% confidence interval of (0.998-1.000). recursive feature elimination identified the most informative genomic features, enhancing model efficiency. To ensure clinical transparency, we applied interpretability techniques including Local Interpretable Model-Agnostic Explanations and permutation feature importance, which highlighted the key drivers of each prediction. The calibration curve further confirmed the reliability of predicted probabilities, supporting their potential use in clinical decision-making. On an independent ClinGen dataset, Extra Trees achieved 99.1% accuracy and outperformed widely used predictors confirming its robustness and clinical applicability. This is the first comprehensive benchmarking study to apply ML models specifically to breast cancer-related missense variants using disease-gene-specific training data and integrated interpretability. Our results show that disease-specific ML approaches outperform general predictors, offering improved reliability, transparency, and relevance to clinical genomics. By bridging the gap between broad genome-wide tools and tailored clinical prediction, this study lays the foundation for implementing ML-driven pathogenicity prediction in breast cancer diagnostics and precision medicine, with potential expansion to other disease contexts.