dc.description.abstract |
Text processing is a highly demanding research area in natural language processing domain in current context. The knowledge gathered using text processing is used in variety of other domains such as artificial intelligent, optical reading, chat bots and so on. On the other hand, language detection in text has also become a trending study due to the usage of multiple languages on the internet. Further, the language identification has become a difficult function in bilingual (mix of two languages) and multilingual (mix of more than two languages) data. Accordingly, this research presents a method to detect tokens written in Sinhala and English in code-mixed data. In addition to that, this is the first such study conducted on Sinhala-English code-mixed data as per the best of author’s knowledge at the time of this paper is prepared. To be precise, this is the first attempt to come up with a machine learning model on Sinhala-English code-mixed data written using Latin alphabetic characters. Indeed, if the code-mixed data is having Unicode characters, the language detection is straightforward and can be achieved using a simple Python program. However, when the whole sentence is presented in Latin characters, ambiguity increases, and it is not straightforward to detect the language and this study is a fine attempt to come up with a proper model to address this ambiguity.
In practice, Sri Lankans use Sinhala words together with English in social media platforms for communication, review posting, commenting and so on. Further, there are many methods to detect Singlish words especially Unicode characters, yet the accuracy in these models in determining Sinhala tokens or English tokens in text data (code-mixed data) are questionable. Therefore, this study presents a language detection model using machine learning and natural language processing techniques. Accordingly, two models will be introduced to identify Sinhala-English code-mixed data gathered from social media platforms and another model to identify languages in word level using the state-of-the-art techniques. In addition, the dataset of Sinhala-English code-mixed data was published in
ICTER 2019 [50] to be used for any similar studies and the final study was published in IALP 2019 held in China [51]. |
en_US |