Abstract:
Automatic Speech Recognition (ASR) technology is essential in a variety of applications, such
as voice search, virtual assistants, transcription services, and subtitling for people with hearing
impairments. Despite its numerous applications, developing ASR systems for low-resource languages
like Azerbaijani presents significant challenges due to the scarcity of available data, linguistic
variations, and the unique phonetic properties of the language. This thesis specifically addresses
the development of an ASR system for recognizing numeric data in Azerbaijani, a Turkic language
spoken by approximately 50 million people worldwide. Numeric data recognition has critical
practical applications in industries such as finance and transportation, where accurate and reliable
recognition of numbers is essential.
One of the primary challenges in developing an ASR system for numeric data is the inherent
lack of context available to help disambiguate similar-sounding numbers. Unlike general speech
recognition, numeric data often appears in isolation or with limited accompanying information,
making it more difficult to accurately recognize spoken numbers. This challenge is further
exacerbated in low-resource languages like Azerbaijani.
The objective of this master’s thesis is to develop an ASR system for numeric data in Azerbaijani
by exploring various techniques and methodologies. We investigate the phonetic and linguistic
properties of Azerbaijani relevant to numeric data recognition and analyze the existing resources
for developing an ASR system. The study proposes a framework for ASR system development,
experimenting with different feature extraction and modeling techniques, and evaluating the
performance of the system using appropriate metrics.
In this research, we developed an ASR system for the Azerbaijani language using the Kaldi toolkit.
The ASR model was trained using the classic Hidden Markov Model - Gaussian Mixture Model
(HMM-GMM) architecture, employing both monophone and triphone models along with various
feature extraction techniques such as Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive
Coding (LPC), and Cepstral Mean and Variance Normalization (CMVN). The experimental results
showed that the triphone models generally outperformed monophone models, and the combination of
MFCC, LPC, and CMVN features provided the best performance among the tested feature extraction
techniques. While performance varied across different datasets, our ASR system demonstrated
promising potential for further improvements and adaptation to specific challenges presented by each
dataset.
This thesis contributes to the development of ASR technology for low-resource languages,
specifically Azerbaijani, in the domain of numeric data recognition. The results of this research have
practical implications for industries that rely on accurate and reliable recognition of numeric data,
such as financial services and transportation. As the dataset and ASR system improve, we anticipate
that the impact on various applications, including voice assistants, transcription services, and speech
analytics in Azerbaijani, will be significant. This study lays the foundation for further research and
development of ASR systems for the Azerbaijani language, paving the way for improved and more
robust ASR solutions.