3DInAction: Understanding Human Actions in 3D Point Clouds

CVPR 2024 Highlight
Yizhak Ben-Shabat1, 2,

Oren Shrout

1,
Stephen Gould2,
1Technion 2Australian National University

Actions with highlighted t-patches (DFAUST).

Abstract

We propose a novel method for 3D point cloud action recognition. Understanding human actions in RGB videos has been widely studied in recent years, however, its 3D point cloud counterpart remains under-explored. This is mostly due to the inherent limitation of the point cloud data modality---lack of structure, permutation invariance, and varying number of points---which makes it difficult to learn a spatio-temporal representation. To address this limitation, we propose the 3DinAction pipeline that first estimates patches moving in time (t-patches) as a key building block, alongside a hierarchical architecture that learns an informative spatio-temporal representation. We show that our method achieves improved performance on existing datasets, including DFAUST and IKEA ASM.

Video

3DinAction Pipeline

teaser

Given a sequence of point clouds, a set of t-patches is extracted. The t-patches are fed into a neural network to output an embedding vector. This is done hierarchically until finally the global t-patch vectors are pooled to get a per-frame point cloud embedding which is then fed into a classifier to output an action prediction per frame

t-patch Extraction

Starting from an origin point, we find the nearest neighbours in the next frame iteratively to construct the t-patch subset (non-black points).

GradCAM Visualization

By extendeding the GradCam algorithm for our 3DinAction pipeline we get a score per point in each t-patch. The score is proportional to the point's influence on classifying the frame to a given target class. The results show that, as expected, our approach learns meaningful representations since the most prominent regions are the ones with the informative motion. For example, in the Jumping jacks action (top row) the hands are most prominent as they are making a large and distinct motion.

IKEA ASM dataset t-patches

IKEa ASM results

The image above shows the flip table action for the TV Bench assembly. Visualizing the RGB image (top), and 3D point cloud with t-patches (bottom). t-patches are highlighted in color The blue is on the moving TV Bench assembly, maroon is on the moving persons arm, teal is on the static table surface, and green is on the colorful static carpet.

Acknowledgements

This project received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 893465. We also thank the NVIDIA Academic Hardware Grant Program for providing an A5000 GPU.

BibTeX

@article{benshabat2023tpatches,
  title={3DInAction: Understanding Human Actions in 3D Point Clouds},
  author={Ben-Shabat, Yizhak and Shrout, Oren and Gould, Stephen},
  journal={arXiv preprint arXiv:2303.06346},
  year={2023}
}