JAMIA open

Exploring the potential of large language models for assessing medication adherence to the ESC heart failure guidelines.

Noman Dormosh, Machteld Boonstra, Ameen Abu-Hanna, Folkert W Asselbergs, Iacer Calixto

Published: 202510.1093/jamiaopen/ooaf155

Abstract

Open Access

Objective: To evaluate large language models (LLMs) for automating the assessment of clinician adherence to ESC heart failure pharmacotherapy guidelines. Materials and Methods: We used electronic health records (EHRs) data pertaining to hospitalized heart failure patients. The task was to assess whether discharge medications followed the guidelines. We labeled each record as: (1) all recommended medications are present and target doses achieved; (2) all recommended medications are present, but target doses not achieved; (3) one or more recommended medications are missing. We evaluated three general-domain (GLM-4-9B-chat, Llama3-8B-Instruct, Mistral-7B-Instruct-v0.2) and three medical-specific (Med42-v2-8B, Llama-3-8B-UltraMedical, OpenBioLLM-8B) open-source LLMs under different prompt settings (zero-shot, few-shot and Chain-of-Thought). We fine-tuned the models using synthesized preference data from our EHR data with the Monolithic Preference Optimization without Reference Model (ORPO) method. We performed a learning curve analysis to determine optimal training data size for performance. We assessed LLM performance using the macro F1 score. Results: We included data of 1,141 patients. Adherence to medication and doses was 5.3%. All LLMs scored F1 < 0.40 across most prompt settings (baseline F1 = 0.333). After fine-tuning, four LLMs scored F1 ≥ 0.90; the other two LLMs namely Llama3-8B-Instruct and OpenBioLLM-8B scored F1 = 0.794 and 0.787, respectively. GLM-4-9B-Chat reached peak performance with 40% of the training data, while Mistral-7B-Instruct-v0.2 required 50%. Other models needed more data. Conclusion: Task-specific fine-tuning of LLMs is necessary for optimal performance, and selecting the appropriate LLM for this is important. Without fine-tuning, both general-domain and medical-specific LLMs performed close to random guessing, revealing key limitations in their adaptability to specialized tasks. Medical-specific LLMs showed no clear advantage over general-domain LLMs.