Evaluating the diagnostic accuracy of vision language models for neuroradiological image interpretation.
Aymen Meddeb, Ida Rangus, Paolo Pagano, Insaf Dkhil, Soumaya Jelassi, Keno Bressem, Michael Scheel, Mike P Wattjes, Sonia Nagi, Laurent Pierot, Sebastien Soize
Abstract
Open AccessThis study evaluates the diagnostic performance of commercial and open-source Vision-Language Models (VLMs) in neuroradiological image interpretation, using a dataset of 100 brain and spine cases from Radiopaedia. Five VLMs (Gemini 2.0, OpenAI o1, Llama 3.2 90b, Qwen 2.5, Grok-2-vision) were compared to expert neuroradiologists in generating differential diagnoses based on brief clinical presentations and imaging. Neuroradiologists achieved a mean accuracy of 86.2%, whereas the best-performing VLM (Gemini 2.0) reached 35%. Evaluation of the top three differentials improved VLM accuracy marginally, but remained inferior to human experts. Clinical harm analysis revealed frequent diagnostic risks, primarily treatment delays, with harmful outputs in up to 45% of cases. Error analysis showed consistent failure modes including incorrect anatomical localization, inaccurate imaging descriptions, and hallucinated findings. These results highlight the current limitations of VLMs and underscore the importance of expert oversight in neuroradiological diagnosis.