Abstract:
Automatic Question-Answer generation is a challenging task in natural language
processing. A system developed is capable of automatically generating questions and
answers from history related text content in Tamil language input by user. The system
processes the input text using various NLP techniques and generates questions and
answers. The system has four modules namely, Preprocessing module, Rule-based
module, Named Entity Recognition (NER) module, Question Answer
Generator(QAG) module. Regex patterns and gazetteers are used in rule-based module
and machine learning approach is used for NER module. NER module uses
Conditional Random Fields (CRF) classifier built with features suitable for the domain
and language. Dataset is collected from history textbooks and 23k word tokens are
tagged using IOB2 format. Novel entity tag set specific to history domain are tagged.
NLP techniques such as Sentence tokenization, POS tagging, Stemming, Unicode
conversion uses existing python libraries. Features suitable for the domain and
language selected are experimented with multiple combination. POS tag, stem word,
gazetteer and clue words are features that contributes more for the performance. The
best feature combination produced micro averaged Precision, Recall, F1-score of
87.9%, 67.1% and 76.1% respectively and accuracy of 89.6% on the test dataset. The
NER module produced a better results despite the domain & language related
challenges. Questions are formed using grammatical and defined rules from the named
entities identified from rule-based and NER module. An affix stripping algorithm
implemented to find the inflection suffix. A history text from Wikipedia is evaluated
by 16 native Tamil speakers under categories such as undergraduates, graduates and
experts. According to the evaluation results, 62.22% of total generated questions are
grammatically correct and meaningful questions. Questions generated from Rulebased
module
produces
better
results
compared
to NER
module.
Citation:
Murugathas, R. (2022). Domain specific question and answer generation in Tamil [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hhttp://dl.lib.uom.lk/handle/123/22389