Abstract:
This study concentrates on the investigation, development, and evaluation of Text-to-Speech Synthesis systems based on Deep Learning models for the Azerbaijani Language. We have
selected and compared state-of-the-art models-Tacotron and Deep Convolutional Text-to-Speech
(DC TTS) systems to achieve the most optimal model. Both systems were trained on the 24 h speech
dataset of the Azerbaijani language collected and processed from the news website. To analyze the
quality and intelligibility of the speech signals produced by two systems, 34 listeners participated
in an online survey containing subjective evaluation tests. The results of the study indicated that
according to the Mean Opinion Score, Tacotron demonstrated better results for the In-Vocabulary
words; however, DC TTS indicated a higher performance of the Out-Of-Vocabulary words synthesis.