SBT-Net: a tri-cue guided multimodal fusion framework for depression recognition.
Yujie Huo, Weng Howe Chan, Ahmad Najmi Bin Amerhaider Nuar, Hongyu Gao
Abstract
Open AccessEarly detection of depression is vital for public health, yet current multimodal methods often struggle with challenges such as modality incompleteness, semantic inconsistency, and emotional temporal fluctuation. To address these issues, this paper proposes SBT-Net, a novel Semantic-Bias-Trend guided framework for robust depression detection from audio and text data. The model integrates three innovative modules: a semantically guided cross-modal gating (SGCMG) mechanism that dynamically filters effective modality features based on global semantic cues, a bias-guided tensor product attention (BG-TPA) mechanism that enhances fine-grained fusion and alignment between modalities, and an emotion trend modeling (ETM) module that captures the temporal evolution of depressive emotional states.We evaluate SBT-Net using two widely adopted benchmark datasets: DAIC-WOZ, which contains 189 interview sessions, and EATD-Corpus, comprising 162 conversational samples. Experimental results show that SBT-Net achieves excellent performance in multiple indicators, including 93.0% accuracy, 0.93 F1 score, and 0.92 recall, all of which surpass the competitive baselines.Ablation studies further validate the individual and synergistic contributions of each proposed module.These findings highlight the potential of integrating semantic guidance, bias-aware fusion, and emotional trend modeling to advance multimodal depression detection solutions. The source code can be found at https://github.com/ghy-yhg/SBT-Net .