Text summarization for Tamil online sports news using NLP

Priyadharshan T

UoM IR
→
Thesis & Dissertation
→
Faculty of IT, Computational Mathematics
→
Master of Science in Artificial Intelligence
→
View Item

Text summarization for Tamil online sports news using NLP

Priyadharshan T

URI: http://dl.lib.mrt.ac.lk/handle/123/15843

Abstract:

Text summarization plays an important role in natural language understanding and information retrieval. Presently automatic text summarization getting much more attention by people because it is efficiently and effectively serving time in decision making process even in day to day life. Many approaches such as statistical based, machine learning based approaches have been presented by researchers where statistical based approaches are less semantic consideration in terms of forming summary and most machine learning approaches are language independent. Presently neural network models get more attention than the traditional approaches. There are few statistical based approaches that are presented for Tamil text summarization with less natural language processing. The primary objective of this research work is to propose a methodology to address the problem of summarization for Tamil sports news which can automatically create extractive summary for the news data with the use of Natural Language Processing (NLP) and a generic stochastic artificial neural network. The sports news gathered from different resources has been given as input to the system where most relevant sentences will be extracted from the text and presented as an extractive summary to the input text. The input will go through six sub process such as pre-processing, feature extraction, feature vector matrix, feature enhancement, sentence extraction and summary generation. Where in the pre-processing the sentences will be initially tokenized. After this set of stop words will be removed from the tokenized output, finally named entities available within the text will be tagged such as person’s name, location name, date, numeral. After pre-processing, feature extraction will be executed where features such as sentence position, sentence position related to paragraph, number of named entities, term frequency and inverse document frequency and number of numerals are employed to generate a score against each sentence available in the text. By using these scores in feature vector matrix sub process, feature matrix will be generated for the whole text where each feature score values for each sentence available in the text is arranged in the row of the matrix. This feature vector matrix will be given as an input to the Restricted Boltzmann Machine which is embedded in the feature enhancement sub process to generate the enhanced feature matrix. After obtaining the Enhanced feature matrix in the sentence extraction process, row values are summed where it gives the summed enhanced feature values for each sentence, after this high score sentence will be extracted as the most relevant sentence to form the summary where it is considered as a sub set of sentences in the summary. And at last the summary generation process will be executed where most relevant sentence selected in the sentence extraction module will be used for cosine similarity measures with the other existing sentences in the text and another sub set of sentences will be extracted from the text to form the summary, likewise the process will be done. Finally, the sentences will be ordered as in the order in the text and extracted summary of the text will be presented. This summary generation will happen on real time by using different resources. iii A comparative evaluation has been done for the text summarization systems’ result. For evaluation purpose, 30 news data set has been used, where each summary regarding to each news data set, has been evaluated by 3 Tamil speaking human assessors. Each news has been distributed among those evaluators and they have to read the news data and they have to select the sentences which will form the summary, likewise the responses for each news data set has been gathered. In the experiment, each and every summary generated by the system has been evaluated against the human generated summary and the average F-measure of this text summarization system is 76.6% which is higher than the existing approaches for the Tamil text summarization approaches.

Citation:

Priyadharshan, T. (2019). Text summarization for Tamil online sports news using NLP [Master’s theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.mrt.ac.lk/handle/123/15843

Show full item record