Annals of surgery open : perspectives of surgical history, education, and clinical approaches

Leveraging Large Language Models to Evaluate the Quality of Narrative Feedback for Surgery Residents in Competency-Based Medical Education.

Benjamin Y M Kwan, Zier Zhou, Nick Rogoza, Nikoo Aghaei, Ingrid de Vries, Tessa Hanmore, Boris Zevin

Published: 202510.1097/AS9.0000000000000608

Abstract

Open Access

Objective: This study aimed to investigate large language model (LLM) performance in evaluating narrative feedback quality in the entrustable professional activities (EPAs) assessments within a Surgical Foundations program. Background: Transitioning to competency-based medical education (CBME) has increased the volume of narrative feedback for surgery residents. However, evaluating narrative feedback quality is time-consuming, requiring manual review by humans. LLMs show potential for automating this process. Methods: An existing dataset of 2229 deidentified comments from EPA assessments for surgery residents in an academic program (2017-2022) was analyzed using generative pre-trained transformer (GPT)-3.5-turbo-1106 and GPT-4-1106-preview. LLM-generated scores were compared to Quality of Assessment for Learning (QuAL) scores assigned by human raters. F1 score was the primary metric for model accuracy. Performance improvements were measured for each LLM by comparing F1 scores across different prompting techniques and fine-tuning strategies against baseline performance. Results: GPT-3.5 and GPT-4 performance varied significantly across prompting techniques due to differences in model architecture. GPT-4 achieved the highest F1 scores for Suggestion (0.901) and Connection (0.882) but underperformed in the Evidence dimension (0.554) of the QuAL score. Fine-tuning was not available for GPT-4 during the study, although fine-tuned GPT-3.5 showed improved LLM performance with high F1 scores for Evidence (0.827), Suggestion (0.949), and Connection (0.933). Conclusions: Fine-tuned GPT-3.5 demonstrated strong potential for automating the evaluation of narrative feedback quality for surgery residents. However, LLM performance depends on the task and how well task structure aligns with the LLM architecture. LLM use in CBME may facilitate continuous quality improvement, providing faculty with automated feedback on their feedback.