Pre-trained language model - based semi - supervised learning approach for content - based email categorization

Kankanamge ND

UoM IR
→
Thesis & Dissertation
→
Faculty of IT, Computational Mathematics
→
Master of Science in Artificial Intelligence
→
View Item

dc.contributor.advisor	Silva A T P
dc.contributor.author	Kankanamge ND
dc.date.accessioned	2020
dc.date.available	2020
dc.date.issued	2020
dc.identifier.citation	Kankanamge, N.D. (2020). Pre-trained language model - based semi - supervised learning approach for content - based email categorization [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21466
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/21466
dc.description.abstract	Recent developments in neuroscience have revolutionized modern trends in artificial intelligence. Artificial neural networks (ANN), which is the artificial model of the human brain, have started to dominate in the field of artificial intelligence. The major usage of ANN is for data classification and prediction. There are numerous applications of ANN, ranging from health, education, entertainment, and business. Email classification has been an issue for many of the large organizations as it needs human interaction. There are many artificial intelligence-based solutions have been proposed. When it comes to content-based email filtering, many recent researchers have identified that the use of ANN-based approaches are much more useful than conventional natural language modelling methods, as the volume of data increased. One reason for this is ANN has been able to capture some of the hidden styles of writing which have not been captured by conventional natural language processing. However conventional ANN has been suffering from lack of labeled data for training. This has been the major drawback of conventional ANN approach as generating labeled data needs human interaction and therefore making it a costly process. This has limited ANN solutions from providing a generic approach for email classification in any domain since to succeed, it needs large a number of labeled data from each of these domains to train the particular ANN. This thesis report on our research on content-based email classification using semi-supervised learning which will address the issues with conventional ANN. Semi-supervised learning was introduced around 15 years back but came to play an important role in the field of artificial intelligence recently. Semi-supervised learning provides a solution to this issue as it needs a minimum amount of labeled data for training and it can use unlabeled data to increase its’ accuracy. Proposed solution is a multi-view core-training approach that takes labeled emails, unlabeled emails and the names of the different categories as inputs. Output of the project is a trained model that can classify emails to given categories. We have tested our solution with 10000 training samples where only 10% to 20% were given to the system as labeled data and others were used as unlabeled data. We managed to achieve around 0.888 accuracy which is more than 5% accuracy improvement from the total system.	en_US
dc.language.iso	en	en_US
dc.subject	EMAIL CATEGORIZATION SYSTEM	en_US
dc.subject	EMAIL CATEGORIZATION	en_US
dc.subject	SEMI-SUPERVISED LEARNING-BASED SOLUTION	en_US
dc.subject	INFORMATION TECHNOLOGY -Dissertation	en_US
dc.subject	ARTIFICIAL INTELLIGENCE -Dissertation	en_US
dc.subject	COMPUTATIONAL MATHEMATICS -Dissertation	en_US
dc.title	Pre-trained language model - based semi - supervised learning approach for content - based email categorization	en_US
dc.type	Thesis-Abstract	en_US
dc.identifier.faculty	IT	en_US
dc.identifier.degree	MSc. in Artificial Intelligence	en_US
dc.identifier.department	Department of Computatio9nal Mathematics	en_US
dc.date.accept	2020
dc.identifier.accno	TH5003	en_US