dc.contributor.advisor |
Ranathunga S |
|
dc.contributor.advisor |
Jayasena S |
|
dc.contributor.author |
Epaliyana K |
|
dc.date.accessioned |
2021 |
|
dc.date.available |
2021 |
|
dc.date.issued |
2021 |
|
dc.identifier.citation |
Epaliyana, K. (2021). Using back-translation to improve domain-specific English-Sinhala neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hthttp://dl.lib.uom.lk/handle/123/21665 |
|
dc.identifier.uri |
http://dl.lib.uom.lk/handle/123/21665 |
|
dc.description.abstract |
Machine Translation (MT) is the automatic conversion of text in one language to
other languages. Neural Machine Translation (NMT) is the state-of-the-art MT
technique w builds an end-to-end neural model that generates an output sentence
in a target language given a sentence in the source language as the input.
NMT requires abundant parallel data to achieve good results. For low-resource
settings such as Sinhala-English where parallel data is scarce, NMT tends to give
sub-optimal results. This is severe when the translation is domain-specific. One
solution for the data scarcity problem is data augmentation. To augment the parallel
data for low-resource language pairs, commonly available large monolingual
corpora can be used. A popular data augmentation technique is Back-Translation
(BT). Over the years, there have been many techniques to improve vanilla BT.
Prominent ones are Iterative BT, Filtering, Data Selection, and Tagged BT. Since
these techniques have been rarely used on an inordinately low-resource language
pair like Sinhala - English, we employ these techniques on this language pair
for domain-specific translations in pursuance of improving the performance of
Back-Translation. In particular, we move forward from previous research and
show that by combining these different techniques, an even better result can
be obtained. In addition to the aforementioned approaches, we also conducted
an empirical evaluation of sentence embedding techniques (LASER, LaBSE, and
FastText+VecMap) for the Sinhala-English language pair.
Our best model provided a +3.24 BLEU score gain over the Baseline NMT
model and a +2.17 BLEU score gain over the vanilla BT model for Sinhala →
English translation. Furthermore, a +1.26 BLEU score gain over the Baseline
NMT model and a +2.93 BLEU score gain over the vanilla BT model were observed
for the best model for English → Sinhala translation. |
|
dc.language.iso |
en |
en_US |
dc.subject |
NEURAL MACHINE TRANSLATION-English-Sinhala |
|
dc.subject |
LOW-RESOURCE LANGUAGES |
|
dc.subject |
BACK-TRANSLATION |
|
dc.subject |
DATA SELECTION |
|
dc.subject |
ITERATIVE BACK-TRANSLATION |
|
dc.subject |
ITERATIVE FILTERING |
|
dc.subject |
INFORMATION TECHNOLOGY -Dissertation |
|
dc.subject |
COMPUTER SCIENCE -Dissertation |
|
dc.subject |
COMPUTER SCIENCE & ENGINEERING -Dissertation |
|
dc.title |
Using back-translation to improve domain-specific English-Sinhala neural machine translation |
en_US |
dc.type |
Thesis-Full-text |
|
dc.identifier.faculty |
Engineering |
en_US |
dc.identifier.degree |
MSc In Computer Science and Engineering by Research |
en_US |
dc.identifier.department |
Department of Computer Science and Engineering |
en_US |
dc.date.accept |
2021 |
|
dc.identifier.accno |
TH5033 |
en_US |