Abstract:
This Master thesis focuses on the development of a Text-to-Speech (TTS) system for the Azerbaijani language. TTS technology has been gaining popularity due to its ability to generate human-like speech from written text, making it beneficial for people with disabilities, language learners, and those who prefer auditory learning. The thesis starts with an introduction to TTS, its significance, and its history. The literature review section provides an overview of related studies, including the recent advancements in TTS systems. The review covers several topics, such as the different techniques and models used in TTS systems, the evaluation metrics used to assess their performance, and the challenges and limitations of developing TTS systems for low-resource languages. The main focus of the study is the Tacotron-2 architecture, which is known for its high-quality and natural-sounding speech. This architecture consists of two parts: a mel spectrogram generator and a neural vocoder. The mel spectrogram is a representation of the speech signal that captures its spectral information, while the neural vocoder generates the actual speech waveform. The study also explains the data collection process, which is a crucial component of developing a TTS system. The first data collection attempt produced poor-quality data, which prompted the researchers to refine the process by using an audio book with speech alignment. This process resulted in approximately 19 hours of high-quality data, which was used to train the Tacotron-2 architecture. To evaluate the performance of the TTS system, a survey was conducted, and participants were asked to evaluate the system using the Mean Opinion Score .The results showed that the system received a MOS score of 3.3, indicating that it produced acceptable speech quality. In conclusion, this Master thesis provides a comprehensive overview of developing a TTS system for the Azerbaijani language using the Tacotron-2 architecture. The study presents the different components of the TTS system, the data collection process, and the evaluation metrics used to assess the system's performance. It also highlights the challenges and limitations of developing TTS systems for low-resource languages and suggests future directions for improving the system's performance.