Development of a screening model for APL using cell population data and deep learning-extracted WBC scattergram features.
Qi Cai, Bo Ye, Wenbo Zheng, Shihong Zhang, Jingxian Zhang, Yimin Shen, Donglan Yao, Huihui Zhang, Zhixi Huang, Jian Hu, Yushuai Ma, Jianbiao Wang, Yong Wang
Abstract
Open AccessBACKGROUND: Acute promyelocytic leukemia (APL), a high-risk subtype of acute myeloid leukemia, necessitates rapid diagnosis upon hospital admission to mitigate early mortality. Current diagnosing approaches relying on time-consuming genetic testing or morphological expertise are particularly challenging in resource-limited settings. Herein, this study introduces a novel machine learning approach leveraging routine lab data to enable immediate APL suspicion, offering a new diagnostic possibility for under-resourced hospitals. METHODS: We developed a two-stage machine learning model using multi-center retrospective data. The cohort included 94 confirmed APL patients (2020-2024) from three tertiary hospitals, with an external validation set (n = 541) from an independent center. Using four VGG-16 networks, we extracted APL-specific 3D scatterplot features from DIFF and WNB channels of routine blood tests. These features were then fed into an optimized random forest classifier-scatterplot (RFC-S) model, refined via recursive feature elimination and threshold tuning. RESULTS: The RFC-S model achieved near-perfect discrimination, with an AUC of 0.9893 in the test set and 0.9979 in external validation. It maintained 98.15% sensitivity and 95.52% specificity-outperforming conventional methods. SHAP analysis confirmed that key scattergram-derived features (e.g., N_APL_Ratio_YZ) drove predictions. Critically, the model requires no additional tests, making it deployable even in low-resource clinics. CONCLUSIONS: The RFC-S model represents an innovative approach to APL screening by combining deep learning-derived scattergram features with routine blood parameters. This two-stage methodology achieves high diagnostic accuracy (AUC > 0.98) while maintaining computational efficiency. Importantly, the model's ability to utilize existing laboratory data without requiring additional tests makes it particularly valuable for resource-constrained settings where access to genetic testing or hematological expertise may be limited. Our findings suggest this approach could serve as a practical tool for early APL identification, potentially reducing diagnostic delays in diverse clinical environments.