Abstract:
In this paper, various methodologies of acoustic and language models, as well as labeling
methods for automatic speech recognition for spoken dialogues in emergency call centers were
investigated and comparatively analyzed. Because of the fact that dialogue speech in call centers
has specific context and noisy, emotional environments, available speech recognition systems show
poor performance. Therefore, in order to accurately recognize dialogue speeches, the main modules
of speech recognition systems—language models and acoustic training methodologies—as well
as symmetric data labeling approaches have been investigated and analyzed. To find an effective
acoustic model for dialogue data, different types of Gaussian Mixture Model/Hidden Markov Model
(GMM/HMM) and Deep Neural Network/Hidden Markov Model (DNN/HMM) methodologies
were trained and compared. Additionally, effective language models for dialogue systems were
defined based on extrinsic and intrinsic methods. Lastly, our suggested data labeling approaches
with spelling correction are compared with common labeling methods resulting in outperforming
the other methods with a notable percentage. Based on the results of the experiments, we determined
that DNN/HMM for an acoustic model, trigram with Kneser–Ney discounting for a language model
and using spelling correction before training data for a labeling method are effective configurations
for dialogue speech recognition in emergency call centers. It should be noted that this research was
conducted with two different types of datasets collected from emergency calls: the Dialogue dataset
(27 h), which encapsulates call agents’ speech, and the Summary dataset (53 h), which contains voiced
summaries of those dialogues describing emergency cases. Even though the speech taken from the
emergency call center is in the Azerbaijani language, which belongs to the Turkic group of languages,
our approaches are not tightly connected to specific language features. Hence, it is anticipated that
suggested approaches can be applied to the other languages of the same group.