Abstract:
Video Summarization has been one of the most interested research and development
field since the late 2000s, thanks to the evolution of social media and the internet, due
to the influence to provide a concise and meaningful summary of large-scale video.
Even though the video summarization has been elongated through several non-ML and
traditional based techniques and ML-based techniques, generation of correct and
required summaries from the video is yet a limitation. To overcome this concern,
different techniques have been attempted including vision-based approaches and NLP
related approaches. With the inspiration of NLP related Transformer networks,
researchers are looking to integrate such sequence-based learning algorithm into the
video dimension as to apply spatiotemporal extractions. Despite the VS
implementations, another extension of VS has been exponentially emphasized, namely
TVS which generates the summaries of the video via a text format.
Simply the evolution of VS towards TVS is not a straightforward journey since a lot
of blockers have been eliminated using UL, RL, and SL based frameworks. When it
comes to the STOA methods in TVS, Transformer based methods are eventually
highlighted along the T5 based NLP frameworks. Since this area is still at the ground
level, a lot of unknow facts and issues can be explored. Especially the attention-based
sequence modelling of the learning algorithm should be carefully imitated to achieve
the best accuracy improvements. All the improvements are subjected to apply into a
real-time application ulteriorly. To tackle such improvements, a novel standalone
method should be introduced with the simplest network layout which can be applicable
to the embedded devices. This is where the Simple Rule-based Machine Learning
Network to Text-based Video Summarization (SiRuML-TVS) has been unveiled.
Though the network contains a single input of large-scale video and a single output of
meaningful description for the given video, the high-level network layout
compromises three ML modules for Video Recognition, Object Detection, and finally
Text Generation. Each module is subjected to different evaluation criterions however,
the end-to-end full network is evaluated on a single metric. Different combination of
each module can be affected to the performance of the entire pipeline however, the
combination of Transformers and CNNs provide the better tradeoff between accuracy
and the computational inferencing. This makes a hope to deploy the proposed method
in an edged device thus, the gap between theoretical explanation to practical
implementation will be filled.
Citation:
Sugathadasa, U.K.H.A. (2022). Summarization of large-scale videos to text format using supervised based simple rule - based machine learning models [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21478