Abstract:
News aggregators support the readers to view
news from multiple news providers via a single point. At the
moment, the only news aggregator that supports Tamil news is
Google news, which has some noticeable shortages. In this
study, Term Frequency–Inverse Document Frequency and
word embedding (fastText) document representation
techniques were experimented with one pass and affinity
propagation clustering algorithms to news title, as well as title
and body in order to implement a news aggregator for the
Tamil language. For this study we collected data from nine
different news providers. When fastText was applied with one
pass algorithm to news title and body, it managed to beat other
approaches to achieve an average pairwise F-score of 81% with
respect to manual clustering. Also, we were able to create a
Tamil fastText word embedding model using more than 21
million words.
Citation:
M. S. Faathima Fayaza and S. Ranathunga, "Tamil News Clustering Using Word Embeddings," 2020 Moratuwa Engineering Research Conference (MERCon), 2020, pp. 277-282, doi: 10.1109/MERCon50084.2020.9185282.