Clinical validation of AI-assisted contouring in prostate radiation therapy treatment planning: Highlighting automation bias and the need for standardized quality assurance.
Najmeh Arjmandi, Ahmed Reza Sebzari, Fatemeh Molaei, Saeid Rezaei, Maryam Rezaie-Yazdi, Malihe Rezaie-Yazdi
Abstract
Open AccessPURPOSE: This study evaluated the impact of a commercial AI-assisted contouring tool on intra- and inter-observer variability in prostate radiation therapy and assessed the dosimetric consequences of geometric contour differences. METHODS: Two experienced radiation oncologists independently delineated clinical target volume (CTV) and organs at risk (OARs) for prostate cancer patients. Manual contours (Cman) and AI-generated contours (CAI) were compared with adjusted AI contours (CAI,adj). A consensus reference (Cref) served as the benchmark. To evaluate clinical impact, treatment plans were recalculated and replanned on each contour set under identical beam geometries to assess dose-volume histogram (DVH) parameters. RESULTS: AI-assisted contouring significantly improved both intra- and inter-observer agreement. Inter-observer analysis revealed that the Dice similarity coefficient (DSCs) for CTV increased from 0.78 (± 0.11) for Cman to 0.89 (± 0.09) for CAI, adj. Similarly, intra-observer analysis revealed that both oncologists showed significantly higher DSCs for CAI, adj compared to Cman. A thorough geometric comparison to the Cref revealed that while adjustments to CAI improved accuracy, they generally did not surpass Cman for CTV and rectum. Dosimetric analyses demonstrated that, under fixed plan geometry, both Cman and CAI,adj contours yielded lower planning target volume (PTV) D95% values compared with Cref, whereas after replanning, all plans met institutional criteria with no clinically significant differences among contour sets. CONCLUSION: AI-assisted contouring in prostate radiotherapy reduced intra- and inter-observer variability and improved contouring consistency. However, CAI, adj did not consistently surpass Cman, especially for the CTV and rectum, where automation bias or selective clinical acceptance may have influenced edits. Fixed-plan recalculations revealed dose differences from minor geometric deviations. These findings underscore the importance of structured quality assurance (QA) and human oversight to mitigate automation bias while leveraging AI's efficiency. The single-institution design with two oncologists and one AI software limits generalizability, underscoring the need for multi-observer validation.