IEEE transactions on pattern analysis and machine intelligence

${\text{CA}^{2}\text{ST}}$: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition.

Jongseo Lee, Joohyun Chang, Dongho Lee, Jinwoo Choi

Published: 202510.1109/TPAMI.2025.3628653

Abstract

We propose Cross-Attention in Audio, Space, and Time (C $\text{A}^{2}$ ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a bal…

Preview only. Read the full abstract at the source

View at DOI