Abstract:
Bilingual lexicons are an important resource in Natural Language Processing (NLP). Such resources are scarce for Low Resource languages (LRLs) such as Sinhala. However, research on Bilingual Lexical Induction (BLI) on low resource settings is limited. This paper presents the first-ever implementation of BLI for the Sinhala-English language pair. Following the recently introduced VecMap model, we map the vectors of words belonging to both Sinhala and English into a shared vector space and measure the Cross Lingual (CL) similarity between the words. The closest English word for a given Sinhala word in this CL vector space is taken as the corresponding similar word. Currently, there is no detailed evaluation with respect to the size and the nature of the dataset used to create the word vectors, type of the evaluation dictionary, or the technique used to create the word vectors. This paper presents a comprehensive analysis of how these factors affect BLI for Sinhala and English languages and shows that the BLI results have a heavy dependency on these factors.
Citation:
A. Liyanage, S. Ranathunga and S. Jayasena, "Bilingual Lexical Induction for Sinhala-English using Cross Lingual Embedding Spaces," 2021 Moratuwa Engineering Research Conference (MERCon), 2021, pp. 579-584, doi: 10.1109/MERCon52712.2021.9525667.