IEEE transactions on pattern analysis and machine intelligence
${\text{CA}^{2}\text{ST}}$: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition.
Jongseo Lee, Joohyun Chang, Dongho Lee, Jinwoo Choi
Published: 202510.1109/TPAMI.2025.3628653
Abstract
We propose Cross-Attention in Audio, Space, and Time (C $\text{A}^{2}$ ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a bal…
Preview only. Read the full abstract at the source