Abstract:
Sinhala and Tamil are declared to be the offi cial lang uages of Sri Lan ka. This requires each government related dissemination/communication to be done in both the languages. Even though the requirement for translation is higher, the number of available human translators is limited. One feasible option to boost the productivity would be assisting the human translators with machine translation output. Here the machine translation output is given to translators to work on by post editing, rather than translating from the scratch. However, Sinhala - Tamil pair does not have any well-performing machine translation system. Therefore, the focus of this research is to develop a machine translation system for short official government documents. This thesis presents two main contributions towards building ‘Si-T a’, the first domainadapted machine trans lation system for Sin hala - Tam il. The first contribution is building the baseline translation system. The second is implementing data pre-processing techniques to improve the translation quality of the base line sys tem. The base line system was built using Moses, a phrase -based stat istical trans lation system. This was the feasible option with the available resources. To improve the quality of the translation, three main approaches were explored. They are: (a) domain adaptation, (b) integration of terminology, dictionary, and name lists, and (c) addressing out-of-vocabulary (OOV) problem using word-embedding-based paraphrasing. In or der to adapt the sys tem for the dom ain of official government documents, different language model design techniques and a data filtration technique were experimented. Under terminology integration, experiments were carried out to evaluate the effect of incorporating bilingual terminology lists to the system. Moreover, a novel data augmentation technique was experimented to generate parallel data using bilingual lists and available parallel data. Further, open domain dictionary entries, as well as a list of person names and addresses were integrated and evaluated. In addition, word-embeddingbased paraphrasing was used along with a novel heuristic-based filtering to address the out-of-vocabulary issue. All the above-mentioned approaches gave an improvement over the baseline, apart from data filtering technique. Yet, all these scores were above the scores of already available machine translation systems for this language pair. Though our techniques/approaches were evaluated only on Sinhala - Tamil pair, they are feasible to be applied to other low-resourced, highly inflectional language pairs.