Investigating object detection errors in endoscopic imaging of esophageal SCC and dysplasia through precision-recall analysis.
Li-Jen Chang, Kun-Hua Lee, Arvind Mukundan, Riya Karmakar, Achmad Bauravindah, Tsung-Hsien Chen, Chien-Wei Huang, Hsiang-Chen Wang
Abstract
Open AccessIntroduction: Esophageal squamous cell carcinoma (ESCC) is difficult to detect early on white-light endoscopy (WLI) because lesions are subtle and artifacts (such as glare, bubbles, text, tools) mimic pathology. Methods: This study benchmarked five object detectors including two You Only Look Once models (YOLOv5, YOLOv), Faster Region-based Convolutional Neural Networks (Faster R-CNN), Single Shot MultiBox Detector (SSD) and Real-time Detection Transformer (RT-DETR) on WLI dataset using harmonized training (from scratch, 150 epochs, identical hyperparameters) and two label configurations: a 4-label as major categories (SCC, Dysplasia, Bleeding, Inflammation) and an 11-label artifact. Evaluation used macro precision/recall/F1 at IoU 0.50 on a fixed 310-image test set. Results: Incorporating artifact classes improved overall macro metrics, with YOLOv5/YOLOv8 providing the strongest performance in the 11-label scenarios, however, class-wise findings revealed persistent recall limitations for early disease. In the 11-label analysis, Dysplasia detection remained low (YOLOv5: 88/201, 43.8%; YOLOv8: 82/201, 40.8%), and SCC was only moderate (YOLOv5: 25/44, 56.8%; YOLOv8: 24/44, 54.5%). Confusion analyses showed that errors were dominated by non-detections ("background") rather than misclassification with benign or artifact labels, while approximately one in five lesion predictions was a spurious unmatched false positive, implicating both sensitivity and specificity constraints. Discussion: These results indicate that labeling artifacts reduces non-lesion confusion but does not, by itself, recover subtle early lesions. Limitations include single-center, WLI-only data and training from scratch, future work should prioritize endoscopy-specific pretraining, explicit artifact suppression or joint segmentation, and external validation.