Show simple item record

dc.contributor.advisor Silva A T P
dc.contributor.author Silva AKG
dc.date.accessioned 2021
dc.date.available 2021
dc.date.issued 2021
dc.identifier.citation Silva, A.K.G. (2021). Generic information extraction framework for document processing [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21467
dc.identifier.uri http://dl.lib.uom.lk/handle/123/21467
dc.description.abstract Information extraction from documents has become great use of novel natural language processing areas. Most of the entity extraction methodologies are variant in a context such as medical area, financial area, also come even limited to the given language. Rather than tackling this problem in such manner, it is better to have one generic approach which is applicable for any of such document types to extract entity information regardless of language, context and structure. Also, the great barrier in such research is exploring the structure while keeping the hierarchical, semantic and heuristic features. Another problem identified is that usually, it requires a massive training corpus. Therefore, this research focus on mitigating such problems. Throughout the research timeline, several approaches have been identifying towards building document information extractors focusing on different disciplines. Starting from optical character recognition of document images to data mining of large corpus of documents this research area has been contributed to the development of natural language processing, semantic analysis, information extraction and conceptual modelling. Although in separate ways those are trying to achieve the generic ability to process any kind of document which unfortunately not being achieved successfully due to the approach and technical limitations. As per the approach within this research, it can process any kind of document in any domain by simply adhering the conceptual relations without being trying to extract component-wise and mapping into known structures. Just as a human being look at any unknown document and going through the relations and making best guesses on answering the queries, this system will also mimic the same behaviour. As per the output, it can either document Concept-Relation or some answer for the given query. The experimental strategy has partaken with regards to several different datasets originated from SQUAD 2.0, DOCVQA dataset, SQUAD 2.0 dataset and Kaggle based datasets. Based on F1 evaluation metric it performs with overall 87.01 performance rate on SQUAD 2.0 dataset showcasing its capable of question-answering task with higher accuracy. Upon diving into experimental design, starting from the dataset evaluation several experiments have been carried out. Datasets such as SQUAD 2.0 and DocVQA has been used to evaluate the overall performance over metrics such as F1 score, accuracy and ANLS providing scores 87.01,52.78 and 0.583 respectively. The F1 score, which is 87.01 showcase that the provided solution achieves the expected objectives in deriving a generic model fitting for any questionanswering task based on documents. en_US
dc.language.iso en en_US
dc.subject DOCUMENT INFORMATION EXTRACTION en_US
dc.subject INFORMATION EXTRACTION en_US
dc.subject DOCUMENT PROCESSING en_US
dc.subject INFORMATION TECHNOLOGY -Dissertation en_US
dc.subject ARTIFICIAL INTELLIGENCE -Dissertation en_US
dc.subject COMPUTATIONAL MATHEMATICS -Dissertation en_US
dc.title Generic information extraction framework for document processing en_US
dc.type Thesis-Abstract en_US
dc.identifier.faculty IT en_US
dc.identifier.degree MSc. in Artificial Intelligence en_US
dc.identifier.department Department of Computational Mathematics en_US
dc.date.accept 2021
dc.identifier.accno TH5004 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record