Machine learning models for predicting survival in patients of hepatocellular carcinoma with second primary malignancy.
Yu Nie, Lu Nie, Boyu Li, Yuanyuan Wang, Liyuan Wang, Xingyan Lv, Runjie Sun, Mengting Xia, Ruiyang Wang, Xing Cui
Abstract
Open AccessBackground: Hepatocellular carcinoma (HCC) is a major cause of cancer mortality, and an increasing number of long-term survivors develop second primary malignancies (SPMs). Reliable risk prediction in this heterogeneous population remains challenging, and it is unclear whether modern machine learning methods can offer superior prognostic accuracy. This study aimed to develop and validate machine learning models to improve prognostic prediction for HCC survivors. Methods: A total of 1,580 HCC patients with second primary cancer were extracted from the Surveillance, Epidemiology, and End Results (SEER) database and randomly divided into a training group and a test group at a ratio of 7:3. Prognostic prediction models were developed using random survival forest (RSF), DeepSurv, and the COX proportional hazards (COXPH) model. The performance of each model was assessed using the concordance index (C-index) to measure discrimination ability. Additionally, the models' discriminative power, calibration, and clinical utility at 1-, 2- and 3-year intervals were evaluated using the area under the receiver operating characteristic curve (AUROC), calibration plots, and decision curve analysis (DCA). The optimal model was identified by comparing the overall performance of each model, and risk stratification of patients was performed using the risk scores generated by the selected model. The best-performing model was further interpreted with global Shapley Additive exPlanation (SHAP) plots, while individual patient prognosis and interpretation were carried out using local SHAP plots and personalized survival curves. Results: Among the three models, the RSF model demonstrated the highest performance with a C-index of 0.730. It also surpassed the other two models in terms of calibration and clinical applicability. Based on the RSF model, patients were categorized into high-risk (risk score >86.17), intermediate-risk (56.32≤ risk score ≤86.17), and low-risk (risk score <56.32) groups. The SHAP analysis of the RSF model identified surgery as the most significant variable, followed by age, tumor (T) stage, tumor size, and SPM. For individual prognosis prediction, three patients were randomly selected, and the local SHAP plots aligned with the predictions for each patient. Conclusions: The RSF model is superior to the COXPH model and DeepSurv model in predicting the prognosis of HCC patients with second primary cancer, and can provide individualized prediction and interpretation, which facilitates personalized medicine.