Clustered Federated Spatio-Temporal Graph Attention Networks for Skeleton-Based Action Recognition.
Tao Yu, Sandro Pinto, Tiago Gomes, Adriano Tavares, Hao Xu
Abstract
Open AccessFederated learning (FL) for skeleton-based action recognition remains underexplored, particularly under strong client heterogeneity where regular FedAvg tends to cause client drift and unstable convergence. We introduce Clustered Federated Spatio-Temporal Graph Attention Networks (CF-STGAT), a clustered FL framework that leverages attention-derived spatio-temporal statistics from local STGAT models to dynamically group clients and perform attention-weighted inter-cluster fusion that gently align cluster models. Concretely, the server periodically extracts multi-head parameter-based attention descriptors, normalizes and projects them via PCA, and applies K-means to form clusters; a global reference is then computed by attention-similarity weighting and used to regularize each cluster model with a lightweight fusion step. On NTU RGB+D 60/120(NTU 60/120), CF-STGAT consistently outperforms strong FL baselines with the STGAT backbone, yielding absolute top-1 gains of +0.84/+4.09 (NTU 60, X-Sub/X-Setup) and +7.98/+4.18 (NTU 120, X-Sub/X-Setup) over FedAvg, alongside smoother per-client trajectories and lower terminal test loss. Ablations indicate that attention-guided clustering and inter-cluster fusion are complementary: clustering reduces within-group variance whereas fusion limits cross-cluster divergence. The approach keeps local training unchanged and adds only server-side statistics and clustering.