dc.contributor.advisor |
Silva A T P |
|
dc.contributor.author |
Silva AKG |
|
dc.date.accessioned |
2021 |
|
dc.date.available |
2021 |
|
dc.date.issued |
2021 |
|
dc.identifier.citation |
Silva, A.K.G. (2021). Generic information extraction framework for document processing [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21467 |
|
dc.identifier.uri |
http://dl.lib.uom.lk/handle/123/21467 |
|
dc.description.abstract |
Information extraction from documents has become great use of novel natural language
processing areas. Most of the entity extraction methodologies are variant in a context such as
medical area, financial area, also come even limited to the given language. Rather than tackling
this problem in such manner, it is better to have one generic approach which is applicable for
any of such document types to extract entity information regardless of language, context and
structure. Also, the great barrier in such research is exploring the structure while keeping the
hierarchical, semantic and heuristic features. Another problem identified is that usually, it
requires a massive training corpus. Therefore, this research focus on mitigating such problems.
Throughout the research timeline, several approaches have been identifying towards building
document information extractors focusing on different disciplines. Starting from optical
character recognition of document images to data mining of large corpus of documents this
research area has been contributed to the development of natural language processing,
semantic analysis, information extraction and conceptual modelling. Although in separate
ways those are trying to achieve the generic ability to process any kind of document which
unfortunately not being achieved successfully due to the approach and technical limitations.
As per the approach within this research, it can process any kind of document in any domain
by simply adhering the conceptual relations without being trying to extract component-wise
and mapping into known structures. Just as a human being look at any unknown document and
going through the relations and making best guesses on answering the queries, this system will
also mimic the same behaviour. As per the output, it can either document Concept-Relation or
some answer for the given query.
The experimental strategy has partaken with regards to several different datasets originated
from SQUAD 2.0, DOCVQA dataset, SQUAD 2.0 dataset and Kaggle based datasets. Based
on F1 evaluation metric it performs with overall 87.01 performance rate on SQUAD 2.0
dataset showcasing its capable of question-answering task with higher accuracy.
Upon diving into experimental design, starting from the dataset evaluation several experiments
have been carried out. Datasets such as SQUAD 2.0 and DocVQA has been used to evaluate
the overall performance over metrics such as F1 score, accuracy and ANLS providing scores
87.01,52.78 and 0.583 respectively. The F1 score, which is 87.01 showcase that the provided
solution achieves the expected objectives in deriving a generic model fitting for any questionanswering
task
based
on documents. |
en_US |
dc.language.iso |
en |
en_US |
dc.subject |
DOCUMENT INFORMATION EXTRACTION |
en_US |
dc.subject |
INFORMATION EXTRACTION |
en_US |
dc.subject |
DOCUMENT PROCESSING |
en_US |
dc.subject |
INFORMATION TECHNOLOGY -Dissertation |
en_US |
dc.subject |
ARTIFICIAL INTELLIGENCE -Dissertation |
en_US |
dc.subject |
COMPUTATIONAL MATHEMATICS -Dissertation |
en_US |
dc.title |
Generic information extraction framework for document processing |
en_US |
dc.type |
Thesis-Abstract |
en_US |
dc.identifier.faculty |
IT |
en_US |
dc.identifier.degree |
MSc. in Artificial Intelligence |
en_US |
dc.identifier.department |
Department of Computational Mathematics |
en_US |
dc.date.accept |
2021 |
|
dc.identifier.accno |
TH5004 |
en_US |