DAST-GAN: An Adversarial Learning Multimodal Sentiment Analysis Model Based on Dynamic Attention and Spatio-Temporal Fusion.
Wenlong Tan, Bo Zhang
Abstract
Open AccessMultimodal sentiment analysis (MSA) faces two critical challenges: modeling cross-modal correlations in temporally unaligned sequences and maintaining robust performance when modalities are partially or entirely missing. While recent approaches have made progress, many still struggle with dynamic contextual dependencies and handling noisy or incomplete inputs. This paper proposes DAST-GAN (Dynamic Adaptive SpatioTemporal Transformer with GAN Enhancement), a framework that addresses these limitations through three synergistic innovations. First, a Dynamic Attention Module (DAM) learns to adaptively weight cross-modal features based on utterance-level semantic context, enabling more nuanced fusion. Second, a unified spatio-temporal attention (STA) mechanism simultaneously captures temporal coherence within and spatial correlations across modalities. Third, a novel GAN-based adversarial training strategy enhances representation robustness by learning to produce features from incomplete data that are indistinguishable from those derived from complete data. This is complemented by a dual-path optimization that aligns features at both low (reconstruction) and high (semantic) levels. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets show that DAST-GAN achieves highly competitive results. Notably, compared to a range of strong baseline methods, it reduces the Mean Absolute Error by 2.6% on CMU-MOSI (complete modality) and 3.76% on CMU-MOSEI (incomplete modality), showcasing its strong accuracy and robustness. Ablation studies validate the complementary effectiveness of the DAM, STA, and GAN components.