Abstract:
Recent developments in neuroscience have revolutionized modern trends in artificial
intelligence. Artificial neural networks (ANN), which is the artificial model of the human brain,
have started to dominate in the field of artificial intelligence. The major usage of ANN is for
data classification and prediction. There are numerous applications of ANN, ranging from
health, education, entertainment, and business.
Email classification has been an issue for many of the large organizations as it needs human
interaction. There are many artificial intelligence-based solutions have been proposed. When
it comes to content-based email filtering, many recent researchers have identified that the use
of ANN-based approaches are much more useful than conventional natural language modelling
methods, as the volume of data increased. One reason for this is ANN has been able to capture
some of the hidden styles of writing which have not been captured by conventional natural
language processing. However conventional ANN has been suffering from lack of labeled data
for training. This has been the major drawback of conventional ANN approach as generating
labeled data needs human interaction and therefore making it a costly process. This has limited
ANN solutions from providing a generic approach for email classification in any domain since
to succeed, it needs large a number of labeled data from each of these domains to train the
particular ANN.
This thesis report on our research on content-based email classification using semi-supervised
learning which will address the issues with conventional ANN. Semi-supervised learning was
introduced around 15 years back but came to play an important role in the field of artificial
intelligence recently. Semi-supervised learning provides a solution to this issue as it needs a
minimum amount of labeled data for training and it can use unlabeled data to increase its’
accuracy. Proposed solution is a multi-view core-training approach that takes labeled emails,
unlabeled emails and the names of the different categories as inputs. Output of the project is a
trained model that can classify emails to given categories. We have tested our solution with
10000 training samples where only 10% to 20% were given to the system as labeled data and
others were used as unlabeled data. We managed to achieve around 0.888 accuracy which is
more than 5% accuracy improvement from the total system.
Citation:
Kankanamge, N.D. (2020). Pre-trained language model - based semi - supervised learning approach for content - based email categorization [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21466