Abstract:
Organising text articles into groups or clusters is known as document clustering.
Documents that belong to a cluster are about the same subject. Document
embeddings should be in the same embedding space for the cross-lingual document
clustering, i.e., similar documents should have similar vectors. Obtaining document
embedding for Tamil and Sinhala is feasible using models like Word2Vec or
FastText, however, these embeddings are language specific, i.e., these will not be in
the same vector space. Therefore, one cannot cluster documents across the languages
using the language specific models. Pre-trained multilingual language models such as
mBERT, XLM-R were introduced to solve this problem by transferring the
knowledge from high resource languages to low resource languages.
This research is conducted to cluster Tamil, Sinhala and English news articles using
XLM-R models. An adequate amount of collected documents were clustered, and the
clustering techniques and performance were evaluated. This research produces a new
baseline for cross-lingual clustering of Tamil, Sinhala, and English documents.
Citation:
Vithulan, M.V. (2022). Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22381