Development and validation of an interpretable machine learning model for prediction of occult lymph node metastasis in clinical stage T1 lung adenocarcinoma.
Yuxing Chen, Jiahui Jin, Yihan Mao, Chengkai Zhou, Qingpeng Zeng, Jun Zhao
Abstract
Open AccessBackground: Accurate comprehensive prediction of occult lymph node metastasis (OLNM) is crucial for optimizing treatment strategy in early-stage lung adenocarcinoma (LUAD). This study aimed to develop and validate a machine learning (ML) model integrating multimodal data for the individualized prediction of OLNM. Methods: A retrospective cohort of 12,679 patients with clinical T1N0 LUAD (≤3 cm) was identified from a single institution. After propensity score matching (PSM) to address class imbalance, a balanced cohort of 614 patients (307 with OLNM, 307 without) was used for model development. Univariable and multivariable logistic regression identified independent predictors. Eight ML models were trained on 80% of the data using these predictors and evaluated on a held-out 20% validation set. The optimal model was selected based on the area under the receiver operating characteristic (ROC) curve (AUC), accuracy, calibration, and decision curve analysis (DCA). SHapley Additive exPlanations (SHAP) were used for model interpretation. Results: Multivariable analysis identified consolidation-to-tumor ratio (CTR), tumor stage (T stage), histologic type, grade, spread through air spaces (STAS) status, and epidermal growth factor receptor (EGFR) mutation as independent predictors. Among all algorithms, the random forest model achieved superior performance, with an AUC of 0.981 on the training set and 0.934 on the validation set. It also demonstrated excellent calibration and provided the highest net benefit across a wide range of threshold probabilities on DCA. SHAP analysis confirmed the dominant role of grade, T stage, and CTR in predicting OLNM and unveiled clinically relevant feature interactions. Conclusions: We developed an interpretable ML model that accurately predicts the risk of OLNM using readily available data. This tool facilitates personalized surgical decision-making, potentially guiding the extent of lymph node dissection (LND) to avoid overtreatment in low-risk patients while ensuring adequate staging and resection in high-risk individuals.