Institutional-Repository, University of Moratuwa.  

Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation

Show simple item record

dc.contributor.advisor Jayasena,S
dc.contributor.advisor Ranathunga S
dc.contributor.author Thillainathan S
dc.date.accessioned 2022
dc.date.available 2022
dc.date.issued 2022
dc.identifier.citation Thillainathan, S. (2022). Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa.http://dl.lib.uom.lk/handle/123/21664
dc.identifier.uri http://dl.lib.uom.lk/handle/123/21664
dc.description.abstract Limited parallel data is a major bottleneck for morphologically rich Low-Resource Languages (LRLs), resulting in Neural Machine Translation (NMT) systems of poor quality. Language representation learning in a self-supervised sequence-to-sequence fashion has become a new paradigm that utilizes the largely available monolingual data and alleviates the parallel data scarcity issue in NMT. The language pairs supported by the Self-supervised Multilingual Sequence-to-sequence Pre-trained (SMSP) model can be fine-tuned using this pre-trained model with a small amount of parallel data. This study shows the viability of fine-tuning such SMSP models for an extremely low-resource domain-specific NMT setting. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on SMSP models. To demonstrate, we select Sinhala, Tamil and English languages in extremely lowresource settings in the domain of official government documents. This research explores the ways to extend SMSP models to adapt to new domains and improve the fine-tuning process of SMSP models to obtain a high-quality translation in an extremely lowresource setting. We propose two novel approaches: (1) Continual Pre-training of the SMSP model in a self-supervised manner with domain-specific monolingual data to incorporate new domains and (2) multistage fine-tuning of the SMSP model with in- and out-domain parallel data. Our experiments with Sinhala (Si), Tamil (Ta) and English (En) show that directly fine-tuning (single-step) the SMSP model mBART for LRLs significantly outperforms state-of-the-art Transformer based NMT models in all language pairs in all six bilingual directions. We gain a +7.17 BLEU score on Si→En translation and a +6.74 BLEU score for the Ta→En direction. Most importantly, for non-English centric Si-Ta fine-tuning, we surpassed the state-of-the-art Transformer based NMT model by gaining a +4.11 BLEU score on Ta→Si and a +2.78 BLEU score on Si→Ta. Moreover, our proposed approaches improved performance strongly by around a +1 BLEU score compared to the strong single-step direct mBART fine-tuning for all six directions. At last, we propose a multi-model ensemble that improved the performance in all the cases where we obtained the overall best model with a +2 BLEU score improvement. en_US
dc.language.iso en en_US
dc.subject PRE-TRAINING en_US
dc.subject FINE-TUNING en_US
dc.subject LOW-RESOURCE LANGUAGES en_US
dc.subject MBART en_US
dc.subject PRE-TRAINED LANGUAGE MODELS en_US
dc.subject NEURAL MACHINE TRANSLATION en_US
dc.subject INFORMATION TECHNOLOGY -Dissertation en_US
dc.subject COMPUTER SCIENCE -Dissertation en_US
dc.subject COMPUTER SCIENCE & ENGINEERING -Dissertation en_US
dc.title Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation en_US
dc.type Thesis-Full-text en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc In Computer Science and Engineering by Research en_US
dc.identifier.department Department of Computer Science and Engineering en_US
dc.date.accept 2022
dc.identifier.accno TH5032 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record