This paper presents Ske2Grid, a new representation learning framework for improved skeleton-based action recognition. In Ske2Grid, we define a regular convolution operation upon a novel grid representation of human skeleton, which is a compact image-like grid patch constructed and learned through three novel designs, namely graph-node index transform (GIT), up-sampling transform (UPT) and progressive learning strategy (PLS). We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. Experiments show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without bells and whistles.
@inproceedings{cai2023ske2grid,
author = {Cai, Dongqi and Kang, Yangyuxuan and Yao, Anbang and Chen, Yurong},
title = {Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition},
booktitle = {International Conference on Machine Learning},
year={2023}
url={https://openreview.net/pdf?id=SQtp4uUByd}
}
Convolutional Neural Networks (CNNs) have been the dominant model for video action recognition. Due to the huge memory and compute demand, popular action recognition networks need to be trained with small batch sizes, which makes learning discriminative spatial-temporal representations for videos become a challenging problem. In this paper, we present Dynamic Normalization and Relay (DNR), an improved normalization design, to augment the spatial-temporal representation learning of any deep action recognition model, adapting to small batch size training settings. DNR introduces two dynamic normalization relay modules to explore the potentials of cross-temporal and cross-layer feature distribution dependencies for estimating accurate layer-wise normalization parameters. These two DNR modules are instantiated as a light-weight recurrent structure conditioned on the current input features, and the normalization parameters estimated from the neighboring frames based features at the same layer or from the whole video clip based features at the preceding layers. Experimental results show that DNR brings large performance improvements to the baselines, achieving over 4.4% absolute margins in top-1 accuracy without training bells and whistles. More experiments on 3D backbones and several latest 2D spatial-temporal networks further validate its effectiveness.
@inproceedings{cai2021dynamic,
author = {Cai, Dongqi and Yao, Anbang and Chen, Yurong},
title = {Dynamic Normalization and Relay for Video Action Recognition},
booktitle = {Advances in Neural Information Processing Systems},
year={2021}
url={https://proceedings.neurips.cc/paper/2021/file/5bd529d5b07b647a8863cf71e98d651a-Paper.pdf}
}