dc.contributor.advisor |
Jayasena,S |
|
dc.contributor.advisor |
Ranathunga S |
|
dc.contributor.author |
Thillainathan S |
|
dc.date.accessioned |
2022 |
|
dc.date.available |
2022 |
|
dc.date.issued |
2022 |
|
dc.identifier.citation |
Thillainathan, S. (2022). Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa.http://dl.lib.uom.lk/handle/123/21664 |
|
dc.identifier.uri |
http://dl.lib.uom.lk/handle/123/21664 |
|
dc.description.abstract |
Limited parallel data is a major bottleneck for morphologically rich Low-Resource Languages
(LRLs), resulting in Neural Machine Translation (NMT) systems of poor quality. Language representation
learning in a self-supervised sequence-to-sequence fashion has become a new paradigm
that utilizes the largely available monolingual data and alleviates the parallel data scarcity issue
in NMT. The language pairs supported by the Self-supervised Multilingual Sequence-to-sequence
Pre-trained (SMSP) model can be fine-tuned using this pre-trained model with a small amount of
parallel data.
This study shows the viability of fine-tuning such SMSP models for an extremely low-resource
domain-specific NMT setting. We choose one such pre-trained model: mBART. We are the
first to implement and demonstrate the viability of non-English centric complete fine-tuning on
SMSP models. To demonstrate, we select Sinhala, Tamil and English languages in extremely lowresource
settings in the domain of official government documents.
This research explores the ways to extend SMSP models to adapt to new domains and improve
the fine-tuning process of SMSP models to obtain a high-quality translation in an extremely lowresource
setting. We propose two novel approaches: (1) Continual Pre-training of the SMSP model
in a self-supervised manner with domain-specific monolingual data to incorporate new domains
and (2) multistage fine-tuning of the SMSP model with in- and out-domain parallel data.
Our experiments with Sinhala (Si), Tamil (Ta) and English (En) show that directly fine-tuning
(single-step) the SMSP model mBART for LRLs significantly outperforms state-of-the-art Transformer
based NMT models in all language pairs in all six bilingual directions. We gain a +7.17
BLEU score on Si→En translation and a +6.74 BLEU score for the Ta→En direction. Most importantly,
for non-English centric Si-Ta fine-tuning, we surpassed the state-of-the-art Transformer
based NMT model by gaining a +4.11 BLEU score on Ta→Si and a +2.78 BLEU score on Si→Ta.
Moreover, our proposed approaches improved performance strongly by around a +1 BLEU
score compared to the strong single-step direct mBART fine-tuning for all six directions. At last,
we propose a multi-model ensemble that improved the performance in all the cases where we
obtained the overall best model with a +2 BLEU score improvement. |
en_US |
dc.language.iso |
en |
en_US |
dc.subject |
PRE-TRAINING |
en_US |
dc.subject |
FINE-TUNING |
en_US |
dc.subject |
LOW-RESOURCE LANGUAGES |
en_US |
dc.subject |
MBART |
en_US |
dc.subject |
PRE-TRAINED LANGUAGE MODELS |
en_US |
dc.subject |
NEURAL MACHINE TRANSLATION |
en_US |
dc.subject |
INFORMATION TECHNOLOGY -Dissertation |
en_US |
dc.subject |
COMPUTER SCIENCE -Dissertation |
en_US |
dc.subject |
COMPUTER SCIENCE & ENGINEERING -Dissertation |
en_US |
dc.title |
Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation |
en_US |
dc.type |
Thesis-Full-text |
en_US |
dc.identifier.faculty |
Engineering |
en_US |
dc.identifier.degree |
MSc In Computer Science and Engineering by Research |
en_US |
dc.identifier.department |
Department of Computer Science and Engineering |
en_US |
dc.date.accept |
2022 |
|
dc.identifier.accno |
TH5032 |
en_US |