Translational vision science & technologyHumansChoroid NeoplasmsMaleFemaleCross-Sectional Studies

Assessing the Clinical Utility of Multimodal Large Language Models in the Diagnosis and Management of Pigmented Choroidal Lesions.

Nehal Nailesh Mehta, Evan Walker, Elena Flester, Gillian Folk, Akshay Agnihotri, Ines D Nagel, Melanie Tran, Michael H Goldbaum, Shyamanga Borooah, Nathan L Scott

Published: 202510.1167/tvst.14.10.13

Abstract

Open Access

Purpose: To evaluate the diagnostic and treatment recommendation performance of multimodal large language models (MLLMs) in identifying and classifying retinal lesions as choroidal nevus or melanoma, as well as compare their performance with expert human graders. Methods: This retrospective cross-sectional study included 48 eyes from 47 patients diagnosed with either choroidal nevus or melanoma. Patient demographics, including age, sex, ethnicity, best-corrected visual acuity (BCVA), and symptoms, were documented. Color fundus, autofluorescence, optical coherence tomography, and B-scan images were collected. The ocular images and patient characteristics were presented to ChatGPT 4.0, Gemini Advanced 1.5 Pro, and Perplexity Pro. Responses were recorded and compared with the clinical diagnoses and treatment recommendations made by two expert human graders. Diagnostic and treatment agreement, accuracy, sensitivity, and specificity were analyzed. Results: Gemini consistently outperformed ChatGPT and Perplexity across diagnostic and treatment prompts. The highest model performance was observed for prompts requesting treatment recommendations with clinical information, where Gemini achieved the highest accuracy (0.725), followed by Perplexity (0.647) and ChatGPT (0.314). Performance was lowest for prompts requiring strict clinical criteria, with all models showing poor sensitivity. Both human graders outperformed all MLLMs in accuracy and sensitivity on most prompts (P < 0.005). Accuracy did not improve when provided demographic or clinical data, except for Gemini. Conclusions: Human graders outperform current MLLMs, which show only moderate ability to diagnose choroidal nevi or melanoma from imaging. Translational Relevance: This study highlights limitations and potential of MLLMs in aiding diagnosis and treatment of choroidal lesions.