Institutional-Repository, University of Moratuwa.  

Pre-trained language model - based semi - supervised learning approach for content - based email categorization

Show simple item record

dc.contributor.advisor Silva A T P
dc.contributor.author Kankanamge ND
dc.date.accessioned 2020
dc.date.available 2020
dc.date.issued 2020
dc.identifier.citation Kankanamge, N.D. (2020). Pre-trained language model - based semi - supervised learning approach for content - based email categorization [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21466
dc.identifier.uri http://dl.lib.uom.lk/handle/123/21466
dc.description.abstract Recent developments in neuroscience have revolutionized modern trends in artificial intelligence. Artificial neural networks (ANN), which is the artificial model of the human brain, have started to dominate in the field of artificial intelligence. The major usage of ANN is for data classification and prediction. There are numerous applications of ANN, ranging from health, education, entertainment, and business. Email classification has been an issue for many of the large organizations as it needs human interaction. There are many artificial intelligence-based solutions have been proposed. When it comes to content-based email filtering, many recent researchers have identified that the use of ANN-based approaches are much more useful than conventional natural language modelling methods, as the volume of data increased. One reason for this is ANN has been able to capture some of the hidden styles of writing which have not been captured by conventional natural language processing. However conventional ANN has been suffering from lack of labeled data for training. This has been the major drawback of conventional ANN approach as generating labeled data needs human interaction and therefore making it a costly process. This has limited ANN solutions from providing a generic approach for email classification in any domain since to succeed, it needs large a number of labeled data from each of these domains to train the particular ANN. This thesis report on our research on content-based email classification using semi-supervised learning which will address the issues with conventional ANN. Semi-supervised learning was introduced around 15 years back but came to play an important role in the field of artificial intelligence recently. Semi-supervised learning provides a solution to this issue as it needs a minimum amount of labeled data for training and it can use unlabeled data to increase its’ accuracy. Proposed solution is a multi-view core-training approach that takes labeled emails, unlabeled emails and the names of the different categories as inputs. Output of the project is a trained model that can classify emails to given categories. We have tested our solution with 10000 training samples where only 10% to 20% were given to the system as labeled data and others were used as unlabeled data. We managed to achieve around 0.888 accuracy which is more than 5% accuracy improvement from the total system. en_US
dc.language.iso en en_US
dc.subject EMAIL CATEGORIZATION SYSTEM en_US
dc.subject EMAIL CATEGORIZATION en_US
dc.subject SEMI-SUPERVISED LEARNING-BASED SOLUTION en_US
dc.subject INFORMATION TECHNOLOGY -Dissertation en_US
dc.subject ARTIFICIAL INTELLIGENCE -Dissertation en_US
dc.subject COMPUTATIONAL MATHEMATICS -Dissertation en_US
dc.title Pre-trained language model - based semi - supervised learning approach for content - based email categorization en_US
dc.type Thesis-Abstract en_US
dc.identifier.faculty IT en_US
dc.identifier.degree MSc. in Artificial Intelligence en_US
dc.identifier.department Department of Computatio9nal Mathematics en_US
dc.date.accept 2020
dc.identifier.accno TH5003 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record