medRxiv : the preprint server for health sciences

Automated Seizure Classification Using Multimodal Large Language Models.

Lina Zhang, Richard Jiang, Tonmoy Monsoor, Jessica N Pasqua, Colin M McCrimmon, Prateik Sinha, Kartik Sharma, Muayad Alzuabi, Victor Morales, Hailey M Miranda, Chaya Manjeshwar, Vwani Roychowdhury, Rajarshi Mazumder

Published: 202510.1101/2025.10.07.25337538

Abstract

Open Access

Objective: Accurately distinguishing between epileptic seizures (ES) and nonepileptic seizures (NES) is a significant clinical challenge that typically requires resource-intensive inpatient video-EEG monitoring. Here, we developed a novel Multimodal Large Language Models (MLLMs)-based method for automated extraction of semiological features from videos of seizure events, and subsequently, classified the events as ES or NES. Methods: 90 videos of ES and NES events from 29 patients were obtained from an epilepsy monitoring unit at a large academic hospital. Events were labeled as ES or NES based on expert evaluation of video-EEG recordings and simultaneously annotated with 24 clinically relevant semiological features. We implemented a MLLMs framework that integrates open-source vision-language models (VLMs) and audio-language models (ALMs) to analyze the videos and associated audio tracks and automatically extract these 24 features. The performance of the MLLMs-based feature extraction was evaluated against expert annotations. These features were subsequently used to train several classifiers including K-Nearest Neighbors (KNN), XGBoost, and Deep Factorization Machine, to differentiate ES from NES. Model performance was evaluated using leave-one-patient-out (LOPO) cross-validation. Results: Using KNN, expert-annotated semiological features achieved precision 0.97, recall 0.97, F1-score 0.97, and AUC 0.99, establishing an upper bound on ES/NES classification performance. The MLLMs pipeline achieved an overall mean recall of 0.71, mean accuracy of 0.58, and a mean F1-score of 0.51 for semiological feature extraction compared to expert annotations. The best performing KNN model (k=7) using MLLMs-extracted features achieved a precision of 0.77, recall of 0.76, F1-score of 0.76, and AUC of 0.76 in classifying ES versus NES; correctly identifying 68 out of 90 events. Conclusion: We demonstrate the feasibility of using MLLMs to automatically extract clinically relevant semiological features from seizure videos and classify ES versus NES. MLLMs-based feature extraction and classification offer a promising clinically interpretable approach to aid diagnosis of epilepsy using videos.