Assessment of a Grad-CAM interpretable deep learning model for HAPE diagnosis: performance and pitfalls in severity stratification from chest radiographs.
Ya Yang, Hongmei Yu, Qijie Xiang, Jie Wu, Jianhao Li, Feizhou Du, Yonglin Yang, Peng Wang
Abstract
Open AccessOBJECTIVES: To investigate the feasibility of a deep learning model, using a transfer learning approach, for recognizing high-altitude pulmonary edema (HAPE) on chest X-ray images and exploring its capability for assessing severity. STUDY DESIGN: Retrospective study. METHODS: This retrospective study utilized a multi-source dataset. The pretraining set was derived from the ARXIV_V5_CHESTXRAY database (3,923 images, including 2,303 with edema labels). The primary HAPE-specific training set comprised radiographs from the 950th Hospital of the Chinese People's Liberation Army (1,003 HAPE cases and 702 normal controls; 2007-2023). An external validation set was constructed from recent cases (Jan-Dec 2023) from two hospitals (679 HAPE cases and 436 normal controls), with strict patient separation. We implemented a multi-stage pipeline: (1) A DeepLabV3_ResNet-50 model was trained for lung segmentation on a subset of the pretraining set; (2) MobileNet_V2 and VGG19 architectures underwent pretraining for general pulmonary edema severity grading on the ARXIV_V5_CHESTXRAY dataset; (3) These models were then fine-tuned on the HAPE-specific training set. RESULTS: The segmentation model achieved a Dice coefficient of 99.03%. The binary classification model (VGG19) for edema detection achieved a validation AUC of 0.950. The multi-class models (MobileNet_V2) achieved macroaverage AUCs of 0.92 (3-class) and 0.89 (4-class). The model demonstrated high performance in distinguishing normal (class 0) and severe edema (class 3) (sensitivities: 0.91, 0.88). However, performance was critically low for intermediate grades (classes 1 and 2; sensitivities: 0.16, 0.37). CONCLUSIONS: Transfer learning from general to HAPE-specific edema data produced a model that accurately segments lungs and differentiates severe HAPE from normal cases with high performance. However, its failure to reliably identify intermediate grades underscores the challenges of domain shift and fine-grained radiographic assessment. This work highlights both the promise and pitfalls of using heterogeneous datasets for rare disease diagnosis.