Abstract:
Community based question answering forums are very popular these days. People
tend to refer community forums for opinions in various fields such as electronics,
medical and automobile. It is very easy and useful to find a good opinion freely, but
it is hard to choose the correct one when there are thousands of reviews.
There have been several efforts to automate the activities of community-based
question answering systems, such as the selection of the most relevant answers to the
question (question comment similarity), and identifying the questions already posted
that are similar to the new question (question-question similarity). However, there
are fewer attempts taken to automate the process of duplicate detection in community
question answering systems. At the moment, it is the community itself that manually
detects duplicates. The automation attempts are more into individual domains.
The objective of this research is to implement a mechanism that effectively identifies
duplicate questions in a data set consisting of question-answer sets from multiple
domains. Solution we propose consists of two focus areas such as classification and
retrieval. A neural network composed of two parallel LSTM layers (to represent
query and candidate question), attention layer and a gradient reversal layer (based on
domain) is proposed as the question pair classifier. It’s trained for individual domains
(without gradient reversal) and achieved better accuracy than the latest baseline
research for this dataset for 9 out of 12 domains. For retrieval the approach was to
retrieve 20 candidates using BM25 and re-rank using classifiers trained already. This
selects the duplicate into top 10 with better MAP than BM25 does 6 out of 12
domains. Another important observation is that the common model built with all the
data combined gained better MAP than the individual models for 7 domains out of
12 in the retrieval case.