Abstract:
Text summarization plays an important role in natural language understanding
and information retrieval. Presently automatic text summarization getting much more
attention by people because it is efficiently and effectively serving time in decision
making process even in day to day life. Many approaches such as statistical based,
machine learning based approaches have been presented by researchers where
statistical based approaches are less semantic consideration in terms of forming
summary and most machine learning approaches are language independent. Presently
neural network models get more attention than the traditional approaches. There are
few statistical based approaches that are presented for Tamil text summarization with
less natural language processing. The primary objective of this research work is to
propose a methodology to address the problem of summarization for Tamil sports news
which can automatically create extractive summary for the news data with the use of
Natural Language Processing (NLP) and a generic stochastic artificial neural network.
The sports news gathered from different resources has been given as input to the
system where most relevant sentences will be extracted from the text and presented as
an extractive summary to the input text. The input will go through six sub process such
as pre-processing, feature extraction, feature vector matrix, feature enhancement,
sentence extraction and summary generation. Where in the pre-processing the
sentences will be initially tokenized. After this set of stop words will be removed from
the tokenized output, finally named entities available within the text will be tagged
such as person’s name, location name, date, numeral. After pre-processing, feature
extraction will be executed where features such as sentence position, sentence position
related to paragraph, number of named entities, term frequency and inverse document
frequency and number of numerals are employed to generate a score against each
sentence available in the text.
By using these scores in feature vector matrix sub process, feature matrix will be
generated for the whole text where each feature score values for each sentence
available in the text is arranged in the row of the matrix. This feature vector matrix
will be given as an input to the Restricted Boltzmann Machine which is embedded in
the feature enhancement sub process to generate the enhanced feature matrix. After
obtaining the Enhanced feature matrix in the sentence extraction process, row values
are summed where it gives the summed enhanced feature values for each sentence,
after this high score sentence will be extracted as the most relevant sentence to form
the summary where it is considered as a sub set of sentences in the summary. And at
last the summary generation process will be executed where most relevant sentence
selected in the sentence extraction module will be used for cosine similarity measures
with the other existing sentences in the text and another sub set of sentences will be
extracted from the text to form the summary, likewise the process will be done. Finally,
the sentences will be ordered as in the order in the text and extracted summary of the
text will be presented. This summary generation will happen on real time by using
different resources.
iii
A comparative evaluation has been done for the text summarization systems’ result.
For evaluation purpose, 30 news data set has been used, where each summary
regarding to each news data set, has been evaluated by 3 Tamil speaking human
assessors. Each news has been distributed among those evaluators and they have to
read the news data and they have to select the sentences which will form the summary,
likewise the responses for each news data set has been gathered. In the experiment,
each and every summary generated by the system has been evaluated against the
human generated summary and the average F-measure of this text summarization
system is 76.6% which is higher than the existing approaches for the Tamil text
summarization approaches.
Citation:
Priyadharshan, T. (2019). Text summarization for Tamil online sports news using NLP [Master’s theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.mrt.ac.lk/handle/123/15843