Translation of named entities between Sinhala and Tamil for official government documents

Mokanarangan, T

UoM IR
→
Thesis & Dissertation
→
Faculty of Engineering, Computer Science & Engineering
→
Master of Science By Research
→
View Item

dc.contributor.advisor	Ranathunga, S
dc.contributor.advisor	Thayasivam, U
dc.contributor.author	Mokanarangan, T
dc.date.accessioned	2019-07-19T09:46:24Z
dc.date.available	2019-07-19T09:46:24Z
dc.identifier.uri	http://dl.lib.mrt.ac.lk/handle/123/14620
dc.description.abstract	Analyzing existing machine translation approaches for Sinhala-Tamil official government documents have revealed the shortcomings when translating named entities. The diverse nature of the domain coupled with the lack of resources and morphological complexity are the key reasons for this problem. Our research focuses on translating named entities for official government documents between Tamil and Sinhala. In this research, we focus on identifying and translating named entities to improve the translation performance. We present a novel tag set specific to official government documents and also propose a graph-based semi-supervised approach that works better than state-of-the-art approaches for low-resource settings. We employed this approach to build a large annotated corpus in a cost-effective manner from a smaller amount of seed data and was able to build an annotated corpus of over 200K words each for Tamil and Sinhala. We also implemented a deep-learning approach for Named Entity Recognizer that gave the best output for a completed corpus. Since the deep-learning approach was a generic solution for sequential tagging, we also employed it to build a Part-of-Speech tagger that outperforms existing systems. The University of Moratuwa already has a system for translating official government documents called SiTa. Finally, we incorporated the aforementioned models to build a module that translated named entities and integrated it to SiTa. We empirically show that our modules improve over the baseline for Tamil ! Sinhala and Sinhala ! Tamil translation tasks by upto 0.5 and 1.4 BLEU scores, respectively.	en_US
dc.language.iso	en	en_US
dc.subject	COMPUTER SCIENCE AND ENGINEERING –Thesis, Dissertations	en_US
dc.subject	NAMED ENTITY RECOGNITION	en_US
dc.subject	MACHINE TRANSLATION	en_US
dc.subject	GRAPH–BASED SEMI-SUPERVISED LEARNING	en_US
dc.subject	DEEP LEARNING	en_US
dc.subject	NAMED ENTITY TRANSLATION
dc.title	Translation of named entities between Sinhala and Tamil for official government documents	en_US
dc.type	Thesis-Full-text	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.degree	Master of Science (By Research)	en_US
dc.identifier.department	Department of Computer Science & Engineering	en_US
dc.date.accept	2018-08
dc.identifier.accno	TH3689	en_US