Uncertainty-guided test-time optimization for personalizing segmentation models in longitudinal medical imaging.
Jaehee Chun, Austin Castelo, McKell Woodland, Caleb O'Connor, Mais Al Taie, Mohamed Eltaher, Aashish Gupta, Bilel Daoud, Shanli Ding, Jeddy Bennett, Anirban Maitra, Matthew A Firpo, Kimberly Kirkwood, Eugene J Koay, Kristy K Brock
Abstract
Open AccessBACKGROUND: Accurate and consistent image segmentation across longitudinal scans is essential in many clinical applications, including surveillance, treatment monitoring, and adaptive interventions. While personalized model adaptation using patient-specific prior scans has shown promise, current approaches typically rely on fixed training durations and lack mechanisms to determine optimal stopping points on a per-patient basis, particularly in the absence of validation labels. PURPOSE: We propose an uncertainty-guided test-time optimization (TTO) framework that dynamically adjusts the personalization duration for each patient using a validation-free stopping criterion based on predictive uncertainty. METHODS: Our framework personalizes a generalized segmentation model using patient-specific prior imaging and selects the optimal checkpoint based on the minimum voxel-wise predictive uncertainty, estimated via Monte Carlo Dropout (TTO-MCD) or Deep Ensembling (TTO-DE). We evaluated the approach on three datasets: 214 pancreas (CT) scans, 243 liver (CT) scans, and 175 head-and-neck tumor (MRI) scans, each containing a subset of patients with paired longitudinal scans to enable patient-specific personalization. Each patient's follow-up scan was held out for testing. As a baseline, we implemented a fixed-epoch personalization strategy (Pre-TTO) using a fivefold cross-test design to emulate deployable model selection without test label leakage. RESULTS: TTO methods consistently outperformed the Pre-TTO and unpersonalized baseline across standard metrics, including the Dice Similarity Coefficient (DSC), 95th percentile Hausdorff Distance (HD95), Mean Surface Distance (MSD), and the proposed LogPenalty Score (LPS), which provides a bounded, interpretable scale that jointly reflects volumetric and boundary fidelity. Paired t-tests confirmed statistically significant improvements for pancreas and liver datasets (p < 0.05), while favorable trends were observed in the head-and-neck dataset despite greater anatomical variability. Both TTO-MCD and TTO-DE achieved near-optimal performance without requiring access to labels at test time. CONCLUSION: Uncertainty-guided TTO provides a robust, validation-free strategy for optimizing patient-specific segmentation models in longitudinal medical imaging. By tailoring personalization based on predictive uncertainty, our method improves segmentation quality across a range of imaging modalities and anatomical targets. This framework supports broad clinical deployment of personalized AI and motivates future extensions to contextual integration and multi-label segmentation. Code is publicly available at https://github.com/jchun-ai/uncertainty-tto.