Abstract:
In this study, we investigate the problem of automatic action recognition and classification of
videos. First, we present a convolutional neural network architecture, which takes both motion
and static information as inputs in a single stream. We show the network is able to treat motion
and static information as different feature maps and extract features off them, even though
stacked together. By our results, we justify the use of optic flows as the raw information of
motion. We demonstrate that our network is able to surpass state-of-the-art hand-engineered
feature methods. Furthermore, the effect of providing static information to the network, in the
task of action recognition, is also studied and compared here. Then, a novel pipeline is proposed,
in order to recognize complex actions. A complex activity is a temporal composition of
subevents, and a sub-event typically consists of several low level micro-actions, such as body
movement, done by different actors. Extracting these micro actions explicitly is beneficial for
complex activity recognition due to actor selectivity, higher discriminative power, and motion
clutter suppression. Moreover, considering both static and motion features is vital for activity
recognition. However, how to control the contribution from each feature domain optimally still
remains uninvestigated. In this work, we extract motion features in micro level, preserving the
actor identity, to later obtain a high-level motion descriptor using a probabilistic model. Furthermore,
we propose two novel schemas for combining static and motion features: Cholesky
transformation based and entropy based. The former allows to control the contribution ratio
precisely, while the latter uses the optimal ratio mathematically. The ratio given by an entropy
based method matches well with the experimental values obtained by a Choleksy transformation
based method. This analysis also provides the ability to characterize a dataset, according
to its richness in motion information. Finally, we study the effectiveness of modeling the temporal
evolution of sub-event using an LSTM network. Experimental results demonstrate that
the proposed technique outperforms state- of-the-art, when tested against two popular datasets.