Abstract:
There are several applications when comes to spoken language understanding such as
topic modeling and intent detection. One of the primary underlying components used
in spoken language understanding studies is automatic speech-recognition models. In
recent years we have seen a major improvement in the automatic speech recognition
system to recognize spoken utterances. But it is still a challenging task for lowresource
languages
as
it
requires
hundreds
of
hours
of
audio
input
to
train
an
automatic
speech
recognition
model.
To overcome this issue recent studies have used transfer learning techniques.
However, the errors produced by the automatic speech recognition models
significantly affect the downstream natural language understanding models used for
intent or topic identification. In this work, we have proposed a multi-automatic speech
recognition set up to overcome this issue. We have shown that combining outputs from
multiple automatic speech recognition models can significantly increase the accuracy
of low-resource speech-command transfer-learning tasks than using the output from a
single automatic speech recognition model.
We have come up with convolution neural network-based setups that can utilize
outputs from pre-trained automatic speech recognition models such as DeepSpeech2
and Wav2Vec 2.0. The experiment result shows a 7% increase in accuracy over the
current state-of-the-art low resource speech-command phoneme-based speech intent
classification methodology.
Citation:
Isham, J.M. (2022). Combining Automatic speech recognition models to reduce error propagation in law-resource transfer-learning speech-command recognition [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21854