Abstract:
We investigate the problem of automatic action recognition and classification of videos. In this paper, we present a convolutional neural network architecture, which takes both
motion and static information as inputs in a single stream. We show that the network is able to treat motion and static information as different feature maps and extract features
off them, although stacked together. We trained and tested our network on Youtube dataset. Our network is able to surpass state-of-the-art hand-engineered feature methods.
Furthermore, we also studied and compared the effect of providing static information to the network, in the task of action recognition. Our results justify the use of optic flows
as the raw information of motion and also show the importance of static information, in the context of action recognition.