Abstract:
Sentiment analysis has become a popular topic since the last decade. The increase in the use of internet has led to the increase of user-generated content. This has played an important role in making sentiment analysis more popular among researchers. The user-generated content can provide some valuable insight about the public opinion to the government and various industries.
This research has mainly focused on sentiment analysis of Sinhala language. Sinhala is the most spoken language in Sri Lanka. With the increased use of the internet and social media, there is a considerable amount of information communicated via Sinhala. This has presented a good opportunity to mine the information presented in Sinhala language. Performing Sinhala language sentiment analysis has some difficulties, as Sinhala is morphologically rich and is a language of free order compared to English. Lack of Sinhala language resources has brought challenges from gathering and generating data sets to stemming / lemmatizing algorithms. This research has tried to address the above challenges by developing a Sinhala dataset suitable for sentiment analysis and by developing a stemming algorithm for Sinhala. The dataset is developed by collecting Tweets from Twitter and it has been manually annotated.
In addition to the resource creation, sentiment analysis of Sinhala language is also performed using word embedding as features. Several sentiment analysis experiments are performed by using several machine learning techniques. The accuracy as well as precision and recall are used to identify the best performing model. The problems faced when conducting sentiment analysis for Sinhala language are discussed in the research. The research has discussed the difference between the user-generated content in English and Sinhala.