A large language model for clinical outcome adjudication from telephone follow-up interviews: a secondary analysis of a multicenter randomized clinical trial.
Zhao Shi, Bingqian Wu, Bin Hu, Jian Zhong, Zezhong Li, Fandong Zhang, Zijian Chen, Chun Yang, Bangjun Guo, Qinmei Xu, Huimin Pang, Han Wang, Yueyan Wang, Jinlong Zhao, Jing Xu
Abstract
Open AccessAutomated adjudication of clinical outcomes from telephone follow-ups is crucial for reducing workload and increasing data quality in large-scale trials. Here, we show that a domain-specific large language model (Fu-LLM) effectively automates the preadjudication of key clinical events-including death, hospitalization, and medication use-based on 1,046 vignettes of follow-up telephone interviews conducted across three centers in a randomized clinical trial (China CT-FFR Study 3). Fu-LLM outperforms not only state-of-the-art general-purpose LLMs (e.g. GPT-3.5-turbo, GPT-4o, DeepSeek-v3, Claude 3.5-Sonnet, and Gemini-2.0-Pro) and conventional machine learning models (Support Vector Machine), but also human adjudicators in a silico human-model comparative study. It also shows greater robustness than different versions of GPT-4 do in temporal drift tests. Our findings demonstrate that Fu-LLM can significantly streamline outcome identification in clinical trials, offering a scalable and accurate tool for automating labour-intensive adjudication processes.