Abstract:
Today we can find many use cases for content-based speech classification. These include
speech topic identification and speech command recognition. Among these, speech
command-based user interfaces are becoming popular since they allow humans to interact
with digital devices using natural language. Such interfaces are capable of identifying
the intent of the given query.
Automatic Speech Recognition (ASR) sits underneath all of these applications to
convert speech into textual format. However, creating an ASR system for a language
is a resource-consuming task. Even though there are more than 6000 languages in
the world, all of these speech-related applications are limited to the most well-known
languages such as English, because of the high data requirement of ASR. There is some
past research that looked into classifying speech while addressing the data scarcity.
However, all of these methods have their limitations.
This study presents a direct speech intent identification method for low-resource
languages with the use of a transfer learning mechanism. It makes use of three different
audio-based feature generation techniques that can represent semantic information
presented in the speech. They are unsupervised acoustic unit features, character and
phoneme features. The proposed method is evaluated using Sinhala and Tamil language
datasets in the banking domain. Among these, phoneme based features that can
be extracted from Automatic Speech Recognizers (ASRs) yield the best results in intent
identification. The experiment results show that this method can have more than 80%
accuracy for a 0.5-hour limited speech dataset in both languages.