Word level language identification of code-mixing text in social media using NLP

Shanmugalingam K

UoM IR
→
Thesis & Dissertation
→
Faculty of IT, Computational Mathematics
→
Master of Science in Artificial Intelligence
→
View Item

dc.contributor.advisor	Sumathipala S
dc.contributor.author	Shanmugalingam K
dc.date.accessioned	2019
dc.date.available	2019
dc.date.issued	2019
dc.identifier.citation	Shanmugalingam, K. (2019). Word level language identification of code-mixing text in social media using NLP [Master’s theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.mrt.ac.lk/handle/123/15810
dc.identifier.uri	http://dl.lib.mrt.ac.lk/handle/123/15810
dc.description.abstract	Automatic analyzing and extracting useful information from the noisy social media content are currently getting more attention from the research community. Recent days people easily mixing their native language along with the English language together to express their thoughts in social media, using the Unicode characters written in Roman Scripts. Thus these types of noisy code-mixed text are characterized by a high percentage of spelling mistakes with phonetic typing, wordplay, creative spelling, abbreviations, Meta tags, and so on. Identification of languages at word level become as necessary part for analyzing the noisy content in social media. It would be used as an intimidate language identifier for chatbot application by using the native languages. For this study used Tamil-English and Sinhala-English code-mixed text from social media. Natural Language Processing (NLP) and Machine Learning (ML) technologies used to identify the language tags at the word level. A novel approach proposed for this system implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency used for Tamil-English code-mixed text and features such as Sinhala Unicode characters written in Roman scripts, dictionaries, and term frequency used for Sinhala-English code-mixed text. Different machine learning classifiers such as Support Vector Machines (SVM), Naive Bayes, Logistic Regression, Random Forest and Decision Trees used in the model evaluation process. Ten-fold cross-validation used to evaluate the performance based on language tags at the word level. Among that the highest accuracy of 89.46% was obtained in SVM classifier and 90.5% was obtained in Random Forest classifier for Tamil-English (Tanglish) and Sinhala-English (Singlish) code-mixed text respectively. In the testing process of Tanglish model with SVM and Singlish model with Random Forest gave accuracy as 93.87% and 95.83% respectively for the testing unseen data. Tanglish model with SVM gave F-Measure for ‘tam’ and ‘eng’ tags were 0.965 and 0.894 respectively. Singlish model with Random Forest gave F-Measure for ‘sin’ and ‘eng’ tags were 0.975 and 0.929 respectively. So this the evidence that most of the times the Tanglish model with SVM and Singlish model with Random Forest predict the language labels correctly at word level.	en_US
dc.language.iso	en	en_US
dc.subject	COMPUTATIONAL MATHEMATICS-Dissertations	en_US
dc.subject	ARTIFICIAL INTELLIGENCE-Dissertations	en_US
dc.subject	NATURAL LANGUAGE PROCESSING	en_US
dc.subject	MACHINE LEARNING	en_US
dc.subject	MACHINE LEARNING-Support Vector Machines	en_US
dc.subject	SOCIAL MEDIA	en_US
dc.subject	SOCIAL MEDIA-Code-Mixed Text	en_US
dc.subject	ENGLISH LANGUAGE-Social Media	en_US
dc.subject	SINHALA LANGUAGE-Social Media	en_US
dc.title	Word level language identification of code-mixing text in social media using NLP	en_US
dc.type	Thesis-Full-text	en_US
dc.identifier.faculty	IT	en_US
dc.identifier.degree	MSc in Artificial Intelligence	en_US
dc.identifier.department	Department of Computational Mathematics	en_US
dc.date.accept	2019
dc.identifier.accno	TH3879	en_US