Abstract:
Bilingual Lexicons are important resources appertaining to Natural Language
Processing (NLP) applications such as Neural Machine Translation and Named
Entity Recognition (NER). However, Low Resource Languages (LRLs) equivalent
to Sinhala lack such resources. Manually producing millions of word translations
between languages is exhaustive and almost impossible. An increasingly popular
approach to automatically create such resources is Bilingual Lexical Induction
(BLI).
We created the first-ever BLI model for English and Sinhala language pair using
the existing popular model VecMap. Currently, no prior work has conducted
a sufficient evaluation with respect to the factors, nature of the dataset, type of
embedding model used, or the type of evaluation dictionary used on BLI and how
these factors affect the results of BLI. We fill the gap by executing an extensive
set of experiments with regard to the aforementioned factors on BLI for Sinhala
and English in this thesis.
Furthermore, we enhance the pre- trai ned embeddi ngs to cater to the appl i cati on
by applying sophisticated post-processing approaches. Linear transformation and
effective dimensionality reduction are applied to the pre-trained embeddings before
obtaining cross-lingual word embeddings between Sinhala and English by
applying VecMap. Furthermore, we have introduced dimensionality reduction to
the VecMap algorithm where the algorithm starts the first iteration from a low
dimension to initialize a better solution. Subsequently, the dimensionality of the
embeddings is increased in each iteration until embeddings reach the original di-
mension in the final iteration. We were able to improve the results considerably
by learning a better initial solution and hence an improved final solution. Finally,
we combined the post-processing step with the modified VecMap model to obtain
even better mapping for Sinhala-English language pair which in turn is applicable
in task-specific downstream systems to improve the results of the entire system.
Citation:
Liyanage, A. (2022). Bilingual lexical induction for English-Sinhala [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22103