Feasibility evaluation of large language models in anesthesia-specific post-operative care instructions for total knee arthroplasty.
Dhruv Nagesh, Donald P Keating, Raghu V Divakaruni, Bryan G Beutel
Abstract
Open AccessObjective: Large language models (LLMs) are increasingly applied in medicine, but their role in peri-operative education is underexplored. This pilot feasibility study compared four LLMs in producing post-operative care instructions for total knee arthroplasty (TKA). Methods: OpenAI GPT-4o, Claude 3.7 Sonnet, DeepSeek R1, and Gemini 2.0 Flash generated instructions from a standardized prompt. Outputs were scored (0 = does not meet, 1 = partially meets, 2 = fully meets) for accuracy, clarity, relevance, consistency, and readability. Accuracy was benchmarked against ERAS, ASA guidelines, and UpToDate. Readability was assessed using Flesch-Kincaid indices. Results: Within this limited sample, Claude, GPT-4o, and DeepSeek R1 demonstrated higher observed accuracy than Gemini, with Claude and GPT-4o showing full alignment with reference standards. Clarity scores were comparable across models. All achieved high relevance and internal consistency. Readability varied, with Gemini generating less readable text and GPT-4o and DeepSeek R1 producing more accessible content. Conclusion: LLMs can generate accurate, relevant, and consistent instructions, supporting their potential use in anesthesia education. Attention to readability and plain-language prompting may further enhance clinical utility. Innovation: This study provides one of the first anesthesia-specific evaluations of multiple LLMs, showing feasibility and opportunities for AI-driven patient communication.