Integrating GWAS and machine learning for disease risk prediction in the Taiwanese Hakka population.
Jing-Hong Xiao, Hsiao-Yen Kang, Li-Ching Wu, Tien Hsu, Chin-Pyng Wu, Li-Jen Su
Abstract
Open AccessIntroduction: Genome-wide association studies (GWAS) have identified numerous loci associated with complex diseases, yet their predictive power in small or genetically homogeneous populations remains limited. Integrating machine learning with GWAS offers a path to improve risk prediction and uncover functional variants relevant to precision medicine. Methods: DNA samples from Taiwanese Hakka individuals with type 2 diabetes, hypertension, and eye diseases were analyzed. After standard quality control, 295,589 SNPs were retained. Fourteen machine-learning algorithms were evaluated using SNPs selected through traditional GWAS filtering and refined via wrapper-based feature selection with a best-first search algorithm. Model performance was assessed by internal cross-validation and external validation using Taiwan Biobank data, and functional annotation was conducted through GTEx v10 cis-eQTL analysis. Results: Predictive models relying solely on significant GWAS SNPs achieved moderate internal accuracy but limited generalizability. Incorporating feature-selected SNPs markedly improved performance: the Random Forest model achieved accuracies above 88% in cross-validation and above 85% in external validation, confirmed by 1,000× bootstrap resampling. eQTL analysis identified functional associations such as rs12121653-KDM5B and rs12121653-MGAT4EP, implicating pathways involved in metabolic and mitochondrial regulation. Discussion: These findings demonstrate that integrating GWAS with machine-learning-based feature selection enables the construction of robust, population-specific disease risk models. Given the small sample size of the discovery cohort (n = 96), all predictive results should be interpreted as exploratory. We employed stringent cross-validation and 1,000× bootstrap resampling to reduce overfitting, and genomic control metrics (QQ plots and λGC values) were evaluated to ensure no major test statistic inflation. Independent large-scale validation will still be required. The approach effectively captures additive and interaction-driven genetic components and provides a scalable framework for applying precision medicine to underrepresented or isolated populations.