Institutional-Repository, University of Moratuwa.  

Using back-translation to improve domain-specific English-Sinhala neural machine translation

Show simple item record

dc.contributor.advisor Ranathunga S
dc.contributor.advisor Jayasena S
dc.contributor.author Epaliyana K
dc.date.accessioned 2021
dc.date.available 2021
dc.date.issued 2021
dc.identifier.citation Epaliyana, K. (2021). Using back-translation to improve domain-specific English-Sinhala neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hthttp://dl.lib.uom.lk/handle/123/21665
dc.identifier.uri http://dl.lib.uom.lk/handle/123/21665
dc.description.abstract Machine Translation (MT) is the automatic conversion of text in one language to other languages. Neural Machine Translation (NMT) is the state-of-the-art MT technique w builds an end-to-end neural model that generates an output sentence in a target language given a sentence in the source language as the input. NMT requires abundant parallel data to achieve good results. For low-resource settings such as Sinhala-English where parallel data is scarce, NMT tends to give sub-optimal results. This is severe when the translation is domain-specific. One solution for the data scarcity problem is data augmentation. To augment the parallel data for low-resource language pairs, commonly available large monolingual corpora can be used. A popular data augmentation technique is Back-Translation (BT). Over the years, there have been many techniques to improve vanilla BT. Prominent ones are Iterative BT, Filtering, Data Selection, and Tagged BT. Since these techniques have been rarely used on an inordinately low-resource language pair like Sinhala - English, we employ these techniques on this language pair for domain-specific translations in pursuance of improving the performance of Back-Translation. In particular, we move forward from previous research and show that by combining these different techniques, an even better result can be obtained. In addition to the aforementioned approaches, we also conducted an empirical evaluation of sentence embedding techniques (LASER, LaBSE, and FastText+VecMap) for the Sinhala-English language pair. Our best model provided a +3.24 BLEU score gain over the Baseline NMT model and a +2.17 BLEU score gain over the vanilla BT model for Sinhala → English translation. Furthermore, a +1.26 BLEU score gain over the Baseline NMT model and a +2.93 BLEU score gain over the vanilla BT model were observed for the best model for English → Sinhala translation.
dc.language.iso en en_US
dc.subject NEURAL MACHINE TRANSLATION-English-Sinhala
dc.subject LOW-RESOURCE LANGUAGES
dc.subject BACK-TRANSLATION
dc.subject DATA SELECTION
dc.subject ITERATIVE BACK-TRANSLATION
dc.subject ITERATIVE FILTERING
dc.subject INFORMATION TECHNOLOGY -Dissertation
dc.subject COMPUTER SCIENCE -Dissertation
dc.subject COMPUTER SCIENCE & ENGINEERING -Dissertation
dc.title Using back-translation to improve domain-specific English-Sinhala neural machine translation en_US
dc.type Thesis-Full-text
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc In Computer Science and Engineering by Research en_US
dc.identifier.department Department of Computer Science and Engineering en_US
dc.date.accept 2021
dc.identifier.accno TH5033 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record