What machine learning teaches us about depression prediction across the life course: An exploratory comparison of predictive models.
Rafael Geurgas, Saul J Newman, Evelina T Akimova, Katherine N Thompson, Robbee Wedow
Abstract
Open AccessIdentifying individuals at risk for depression early is important for preventing long-term mental health issues. However, the variability in depression severity, duration, and triggers complicates predictions. This study explores whether machine learning models can outperform traditional methods, like Logistic Regression, in predicting self-reported depressive symptoms and clinical depression during adolescence and adulthood. We applied five machine learning models with varying complexity levels - Logistic Regression, Decision Tree, XGBoost, Support Vector Machine, and Neural Networks - using data from a nationally representative longitudinal study of the U.S., which tracked participants for 20 years. The models were trained with early-life predictors (ages 12-18) from Wave I, including environmental factors (family, school, health) and genetic predispositions (polygenic scores) from Wave IV. Models were evaluated on their ability to predict depressive symptoms and clinical diagnoses in both adolescence and adulthood. After evaluating the performance of all five models, XGBoost emerged as the most effective, with a 0.02 increase in ROC-AUC compared to the benchmark Logistic Regression model. While this is a slight performance improvement, overall, Logistic Regression performs about as well as many of our ML models. Early-life data showed strong predictive value for depressive symptoms and clinical diagnoses in adolescence and adulthood, highlighting adolescence as a critical period. Polygenic scores do not add predictive power when combined with environmental data. Feature importance analyses identified self-perception and physical health as key predictors of depressive symptoms, while trauma and life-changing events were more influential for clinical depression.