Sinhala-English language detection in code-mixed data

Smith JRI

UoM IR
→
Thesis & Dissertation
→
Faculty of Engineering, Computer Science & Engineering
→
Master of Science in Computer science and Engineering
→
View Item

dc.contributor.advisor	Thayasivam U
dc.contributor.author	Smith JRI
dc.date.accessioned	2020
dc.date.available	2020
dc.date.issued	2020
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/16490
dc.description.abstract	Text processing is a highly demanding research area in natural language processing domain in current context. The knowledge gathered using text processing is used in variety of other domains such as artificial intelligent, optical reading, chat bots and so on. On the other hand, language detection in text has also become a trending study due to the usage of multiple languages on the internet. Further, the language identification has become a difficult function in bilingual (mix of two languages) and multilingual (mix of more than two languages) data. Accordingly, this research presents a method to detect tokens written in Sinhala and English in code-mixed data. In addition to that, this is the first such study conducted on Sinhala-English code-mixed data as per the best of author’s knowledge at the time of this paper is prepared. To be precise, this is the first attempt to come up with a machine learning model on Sinhala-English code-mixed data written using Latin alphabetic characters. Indeed, if the code-mixed data is having Unicode characters, the language detection is straightforward and can be achieved using a simple Python program. However, when the whole sentence is presented in Latin characters, ambiguity increases, and it is not straightforward to detect the language and this study is a fine attempt to come up with a proper model to address this ambiguity. In practice, Sri Lankans use Sinhala words together with English in social media platforms for communication, review posting, commenting and so on. Further, there are many methods to detect Singlish words especially Unicode characters, yet the accuracy in these models in determining Sinhala tokens or English tokens in text data (code-mixed data) are questionable. Therefore, this study presents a language detection model using machine learning and natural language processing techniques. Accordingly, two models will be introduced to identify Sinhala-English code-mixed data gathered from social media platforms and another model to identify languages in word level using the state-of-the-art techniques. In addition, the dataset of Sinhala-English code-mixed data was published in ICTER 2019 [50] to be used for any similar studies and the final study was published in IALP 2019 held in China [51].	en_US
dc.language.iso	en	en_US
dc.subject	COMPUTER SCIENCE – Dissertations	en_US
dc.subject	COMPUTER SCIENCE AND ENGINEERING - Dissertations	en_US
dc.subject	TEXT PROCESSING	en_US
dc.subject	NATURAL LANGUAGE PROCESSING	en_US
dc.subject	MULTI LANGUAGE LEARNING	en_US
dc.subject	MACHINE LEARNING -Sinhala-English Code-Mixed Data	en_US
dc.subject	UNICODE CHARACTERS- Singlish	en_US
dc.subject	LANGUAGE DETECTION	en_US
dc.title	Sinhala-English language detection in code-mixed data	en_US
dc.type	Thesis-Full-text	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.degree	MSc in Computer Science and Engineering	en_US
dc.identifier.department	Department of Computer Science and Engineering	en_US
dc.date.accept	2020
dc.identifier.accno	TH4291	en_US