Abstract:
Sentence similarity calculation plays an important role in text processing-related research. Many unsupervised techniques such as knowledge-based techniques, corpus-based techniques, string similarity based techniques, and graph alignment techniques are available to measure sentence similarity. However, none of these techniques have been experimented with Tamil. In this paper, we present the first-ever system to measure semantic similarity for Tamil short phrases using a hybrid approach that makes use of knowledge-based and corpus-based techniques. We tested this system with 2000 general sentence pairs and 100 mathematical sentence pairs. For the dataset of 2000 sentence pairs, this approach achieved a Mean Squared Error of 0.195 and a Pearson Correlation factor of 0.815. For the 100 mathematical sentence pairs, this approach achieved an 85% of accuracy.