SpeechCARE: dynamic multimodal modeling for cognitive screening in diverse linguistic and speech task contexts.
Hossein Azadmaleki, Yasaman Haghbin, Sina Rashidi, Mohammad Javad Momeni Nezhad, Ali Zolnour, Maryam Zolnoori
Abstract
Open AccessSpeechCARE is a multimodal transformer pipeline designed to detect cognitive impairment from brief speech recordings through multiclass classification of Alzheimer's Disease and Related Dementias (ADRD), Mild Cognitive Impairment (MCI), and healthy controls. It integrates an advanced preprocessing pipeline that includes LLM-based audio anomaly detection, speech-task identification, noise reduction, and transcription. Its core architecture fuses mHuBERT (acoustic) and mGTE (linguistic) embeddings with demographic information using a novel Adaptive Gating Fusion mechanism. Additionally, a specialized encoding component further processes mHuBERT outputs to capture global temporal patterns across segmented audio, addressing key limitations of speech transformers in modeling long-range dependencies in extended recordings. Trained on the National Institute on Aging's PREPARE challenge dataset (1655 participants in English, Spanish, and Mandarin), SpeechCARE achieved an average F1-score of 72.11% on the held-out test set (n = 412), earning a special recognition award from NIA. Threshold optimization improved MCI recall. While fairness analysis showed moderate disparities (particularly for Spanish speakers), the model demonstrated strong multilingual generalizability. SpeechCARE complements blood-based biomarkers by capturing functional speech deficits, supporting early, scalable detection.