Is Appearance Free Action Recognition Possible?

Filip Ilic, Thomas Pock, Richard Wildes 2022 European Conference on Computer Vision

Code Paper

Star

Baseball Pitch

Pommel Horse

Apply Eye Makeup

Basketball

HulaHoop

❮ ❯

Can Neural Networks recognize actions solely from the temporal dynamics? We investigate state-of-the-art action recognition networks with respect to their performance on a dataset where no static information is present. The video shows random noise animated with motion extracted from real human motion (the UCF101 dataset). We find that optical flow, or a similarly descriptive representation, is not an emergent property of any of the tested action recognition architectures.

Motivation

Related work has shown often that static biases in action recognition datasets and therefore also in the trained models exist. A simple experiment demonstrates that singleframe classification performance of a ResNet50 on UCF101 frames yields 65.6% Top-1 accuracy.

Action Recognition is sometimes possible from a single frame; Is this a problem?

The Appearance Free Dataset (AFD) disentangles static and dynamic video components. In particular, no single frame contains any static discriminatory information in AFD; the action is only encoded in the temporal dimension.

Examples of Appearance Free Videos

Jumping Jacks

Table Tennis

Weightlifting

❮ ❯

Psychophysical Study

AFD101 is based on UCF101. We provide code and other necessary tools to generate other appearance free videos.

We also perform a user study, showing that human participants can detect the motion from the animated noise patter, and perform much better than current networks. The top row shows human performance on “normal” videos, and the bottom row on our “noisy” videos.

Usestudy: Regular Videos (Top) vs Appearance Free (Bottom)
Our Userstudy shows that people can identify actions from anmiated noise.

Userstudy

It’s also possible to start with different noise (e.g. sparse) and still descern moving shapes.

Animated Sparse Noise

Using sparse noise, we can create videos that evoke a familiarity, at least in spirit, to the perceptual organization experiments of simple movements performed by psychologist Gunnar Johansson [Ref].

Visual perception of biological motion; Johansson Gunnar (1973)

Video modified from: [Credit]

Proposed Architecture

One of our contributions is to show how to build a (new) Two-Stream architecture in a pricipled manner, to encourage a robust modeling of the temporal dimension. We call it Explicit-Two-Stream, as one stream explicitly models temporal correlations through the use of RAFT blocks.

Results

This design choice for our network allows for a rich dynamics representation that outperforms other methods on Appearance Free Data. We show that it closes the gap to human performance on Appearance Free Data significantly.

Cite

@InProceedings{ilic22appearancefree,
title={Is Appearance Free Action Recognition Possible?},
author={Ilic, Filip and Pock, Thomas and Wildes, Richard P.},
booktitle={European Conference on Computer Vision (ECCV)},
month={October},
year={2022},
}

Is Appearance Free Action Recognition Possible?

Motivation

Psychophysical Study

Proposed Architecture

Cite

Further Reading