Institutional-Repository, University of Moratuwa.  

Translation of named entities between Sinhala and Tamil for official government documents

Show simple item record

dc.contributor.advisor Ranathunga, S
dc.contributor.advisor Thayasivam, U
dc.contributor.author Mokanarangan, T
dc.date.accessioned 2019-07-19T09:46:24Z
dc.date.available 2019-07-19T09:46:24Z
dc.identifier.uri http://dl.lib.mrt.ac.lk/handle/123/14620
dc.description.abstract Analyzing existing machine translation approaches for Sinhala-Tamil official government documents have revealed the shortcomings when translating named entities. The diverse nature of the domain coupled with the lack of resources and morphological complexity are the key reasons for this problem. Our research focuses on translating named entities for official government documents between Tamil and Sinhala. In this research, we focus on identifying and translating named entities to improve the translation performance. We present a novel tag set specific to official government documents and also propose a graph-based semi-supervised approach that works better than state-of-the-art approaches for low-resource settings. We employed this approach to build a large annotated corpus in a cost-effective manner from a smaller amount of seed data and was able to build an annotated corpus of over 200K words each for Tamil and Sinhala. We also implemented a deep-learning approach for Named Entity Recognizer that gave the best output for a completed corpus. Since the deep-learning approach was a generic solution for sequential tagging, we also employed it to build a Part-of-Speech tagger that outperforms existing systems. The University of Moratuwa already has a system for translating official government documents called SiTa. Finally, we incorporated the aforementioned models to build a module that translated named entities and integrated it to SiTa. We empirically show that our modules improve over the baseline for Tamil ! Sinhala and Sinhala ! Tamil translation tasks by upto 0.5 and 1.4 BLEU scores, respectively. en_US
dc.language.iso en en_US
dc.subject COMPUTER SCIENCE AND ENGINEERING –Thesis, Dissertations en_US
dc.subject NAMED ENTITY RECOGNITION en_US
dc.subject MACHINE TRANSLATION en_US
dc.subject GRAPH–BASED SEMI-SUPERVISED LEARNING en_US
dc.subject DEEP LEARNING en_US
dc.subject NAMED ENTITY TRANSLATION
dc.title Translation of named entities between Sinhala and Tamil for official government documents en_US
dc.type Thesis-Full-text en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree Master of Science (By Research) en_US
dc.identifier.department Department of Computer Science & Engineering en_US
dc.date.accept 2018-08
dc.identifier.accno TH3689 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record