Abstract:
The bag-of-features (BoF) approach for
human action classification uses spatio-temporal
features to assign the visual words of a codebook. Space
time interest points (STIP) feature detector captures the
temporal extent of the features, allowing distinguishing
between fast and slow movements. This study compares
the relative performance of action classification on KTH
videos using the combination of STIP feature detector
with histogram of gradient orientations (HOG) and
histograms of optical flow (HOF) descriptors. The
extracted descriptors are clustered using K-means
algorithm and the feature sets are classified with two
classifiers: nearest neighbour (NN) and support vector
machine (SVM). In addition, this study compares actionspecific
and global codebook in the BoF framework.
Furthermore, less discriminative visual words are
removed from initially constructed codebook to yield a
compact form using likelihood ratio measure. Testing
results show that STIP with HOF performs better than
HOG descriptors and simple linear SVM outperforms
NN classifier. It can be noticed that action-specific
codebooks when merged together perform better than
globally constructed codebook in action classification on
videos.