dc.description.abstract |
The retrieval of temporal digital visual data, either by a text or visual query, requires
automatic interpretation, which includes high-level annotation by object detection and
recognition for text query-based retrieval and low-level abstraction for visual querybased
retrieval.
Both the accuracy and the speed of the interpretation become crucial
factors in real-world applications, due to the high density of visual data. This study has
focused on reducing the complexity of visual data efficiently by dimensionality
reduction techniques for the detection and recognition of objects in videos for both
textual annotation and visual query-based video frame retrieval. The contribution of
the study includes three approaches, i.e., a novel visual feature descriptor based on
colour dithering – namely Salient Dither Pattern Feature (SDPF), novel object
segmentation method based on the proposed feature descriptor – namely Refining
Superpixel and Histogram of oriented optical flow Clustering (RSHC) –, and a novel
self-supervised local descriptor – namely Network-in-Network with Restricted
Boltzmann Machine (NIN-RBM). The experimental results make it evident that the
SDPF is rotation and scale invariant and computationally efficient yet shows similar
object recognition accuracy to the state-of-the-art methods with minimum supervision.
The results further revealed that RSHC has successfully utilized SDPF for accurately
segmenting individual objects by using a very shallow history of motion. Furthermore,
according to the results, NIN-RBM has shown the state-of-the-art correspondence
matching performance over the existing deep-learned self-supervised binary
descriptors, keeping the computation time at the minimum. The overall results support
the conclusions that RSHC is capable of accurately segment objects in a video, and
then SDPF can be successfully used for recognizing the segmented objects. Moreover,
NIN-RBM can be used to reliably and rapidly retrieve video frames related to any
visual query. Since NIN-RBM is a local descriptor, it can be further used for locating
of high-level objects and estimating their poses precisely, to improve the details of
semantics retrieved from video data. |
en_US |