Abstract:
Speaker identification is a process of identifying a person who is speaking and is very
useful in applications such as customer service or even in investigations and reporting
forensic evidence. This study focuses on finding the relationship between the latest state
of-art technology in speaker recognition which is x-vectors, and the uttered text within
audio signals, as well as, the duration of them. In order to accomplish that, three different
datasets are used: two relatively small digits datasets in English and Azerbaijani, and one
larger dataset of digits and commands in Azerbaijani. The hypotheses tested in this
research are as following: 1) x-vectors hold the information about the text in audio
recordings, and the accuracy of the model changes as the text is changed; 2) x-vectors
show better accuracy with longer audio recordings than shorter ones. All three datasets
were trained to test the first hypothesis and the findings show that when the models are
given audio samples in which a new unseen text is uttered, the accuracy decreases
drastically. The last dataset was used to test the second hypothesis. Indeed, x-vectors are
data-hungry and more speech samples together with longer duration of recordings gave the
best results. Although, most of the experiments are conducted in the Azerbaijani language,
it is believed that the results are not related to the specific language. Moreover, testing
these hypotheses with a dataset of another language will yield the same results, as proved
with the English dataset in this study.