Biomarker-based and interpretable machine learning framework for predicting pathological stage in gastric cancer: A retrospective analysis.
Guanmo Liu, Sen Yang, Jie Li, Zicheng Zheng, Chenggang Zhang, Yixuan He, Yihua Wang, Weiming Kang, Xin Ye
Abstract
Open AccessBackground: Accurate preoperative staging of gastric cancer (GC) is essential for guiding treatment strategies. However, reliable noninvasive tools for distinguishing early-stage from advanced-stage GC remain limited. Methods: This retrospective study enrolled 434 patients with GC. Eleven supervised machine learning algorithms were developed using preoperative laboratory parameters and engineered ratio features capturing inflammatory, metabolic, and tumor-related profiles. CatBoost showed superior performance and was selected for SHapley Additive exPlanations (SHAP)-based interpretation. A forward feature selection strategy identified an optimal nine-feature panel. Model performance was evaluated by area under the receiver operating characteristic curve (AUC), accuracy, precision, recall, and F1-score, with robustness validated through repeated 10-fold cross-validation and 1000 bootstrap iterations. Results: Among 434 patients, 251 (57.8%) had stage I and 183 (42.2%) had stages II-III disease. Incorporating biologically informed ratio features significantly enhanced model performance; CatBoost's AUC improved from 0.802 to 0.981. SHAP-based selection yielded a compact, interpretable nine-feature model. The final CatBoost model achieved a mean AUC of 0.9499 (95% confidence interval (CI): 0.9421-0.9570), with high consistency across cross-validation folds. SHAP analysis identified uric acid (UA) and APTT as key predictors, and interaction analysis revealed stable multivariate relationships, supporting the model's biological plausibility. Conclusions: We developed a robust, interpretable machine learning model for GC staging using routine blood tests and derived ratio features. The model demonstrated excellent discrimination, interpretability, and clinical utility, offering a practical tool for personalized risk stratification and treatment planning.