Abstract:
Understanding social media contents has been a
primary research topic since the dawn of social networking.
Especially, contextual understanding of the noisy text, which is
characterized by a high percentage of spelling mistakes with
creative spelling, phonetic typing, wordplay, abbreviations,
and Meta tags. Thus, the data processing demands a more
complex system than traditional natural language processors.
Also people easily mixing two or more languages together to
express their thoughts in social media context. So automatic
language identification at word level become as necessary part
for analyzing the noisy content in social media. It would help
with the automated analysis of content generated on social
media. This study uses Tamil-English code-mixed data from
popular social media posts and comments and provided word
level language tags using Natural Language Processing (NLP)
and modern Machine Learning (ML) technologies. The
methodology used for this system is a novel approach
implemented as machine learning classifier based on features
such as Tamil Unicode characters in Roman scripts,
dictionaries, double consonant, and term frequency. Different
machine learning classifiers such as Naive Bayes, Logistic
Regression, Support Vector Machines (SVM), Decision Trees
and Random Forest used in training and testing. Among that
the highest accuracy of 89.46% was obtained in SVM classifier.
Citation:
K. Shanmugalingam, S. Sumathipala and C. Premachandra, "Word Level Language Identification of Code Mixing Text in Social Media using NLP," 2018 3rd International Conference on Information Technology Research (ICITR), 2018, pp. 1-5, doi: 10.1109/ICITR.2018.8736127.