International journal of retina and vitreous

Evaluation of the artificial intelligence chatbots in frequently asked questions about retinitis pigmentosa: a comparative analysis between ChatGPT-4 and Gemini-2.0.

Özlem Biçer, Esra Şahlı

Published: 202510.1186/s40942-025-00772-4

Abstract

Open Access

BACKGROUND: To evaluate the accuracy and readability of answers to common retinitis pigmentosa (RP) questions from the popular generative artificial intelligence (AI) chatbots ChatGPT-4 and Gemini-2.0. METHODS: In March 2025, frequently asked questions about RP was entered to Google search tool, and the websites appearing on the first search page were selected for enrollment in the study. ChatGPT-4 and Gemini-2.0 were then prompted to generate responses about RP in both standard and simplified formats. To generate the simplified response, the following request was added to the prompt: 'Please provide a response suitable for the average American adult, at a sixth-grade comprehension level.' The AI chatbots' responses to 30 questions about RP, frequently asked by patients, were evaluated by two ophthalmologists using a five-point Likert scale, with scores ranging from 1-5. Additionally, 8 readability indices, including Average Reading Level Consensus Calculator (ARLC), Automated Readability Index (ARI), Flesch Reading Ease (FRE), Gunning Fog Index (GFOG), Flesch-Kincaid Grade Level (FKGL), Coleman-Liau Index (CL), Simple Measure of Gobbledygook (SMOG), and Forcast Readability Formula (FRF) were calculated using an online calculator, Readabilityformulas.com, to assess the ease of comprehension of each answer. RESULTS: No significant difference showed in accuracy both standard and simplified AI chatbot responses (p = 0.557, p = 0.090). In particular, almost all readability indices suggest that standard AI chatbot responses require a higher level of education for comprehension, whereas simplified responses require a lower level of education. Although Gemini-2.0 standard responses were more readable than ChatGPT-4 standard responses according to ARI, GFOG and FRF scores (p = 0.014, p = 0.040, and p = 0.001), Gemini-2.0 simplified responses were more readable than ChatGPT-4 simplified responses solely according to FRF scores (p = 0.016). CONCLUSIONS: This study shows that ChatGPT-4 and Gemini-2.0 can provide patients with an avenue to access comprehensive and accurate information about, tailored RP to their educational level.