Abstract:
This master thesis focuses on the quantitative (statistical) and semantic analysis of the
Azerbaijani language corpus. The research begins with the creation of the corpus by collecting
textual data on Azerbaijani language. The corpus was composed of various types of texts,
including fiction, non-fiction, news articles, and social media posts. The texts were pre-processed
by removing stop words, punctuation marks, and special characters. The pre-processed texts
were then tokenized and lemmatized, which is the method of breaking down words to their most
basic form.
The corpus was analyzed using various quantitative techniques, such as frequency analysis,
ngrams, concordance analysis, and word semantic similarity with different metrics. Frequency
analysis involves counting the occurrence of words in the corpus, which helps identify the most
commonly used words and their frequency. Ngrams analysis involves counting the frequency of
pairs or triplets of words, which helps identify common collocations and phrases. Finding a word
or phrase's context throughout the corpus using concordance analysis allows for the
identification of a term's usage and meaning in various settings. Word semantic similarity
analysis involves measuring the semantic similarity between words using different metrics,
which helps identify the relatedness of words in the corpus.
The POS tagging of the corpus is the other major component of the thesis. Implemented and
tested were two distinct Part-of-Speech (POS) tagging models using Hidden Markov Models
(HMM) and Long Short-Term Memory (LSTM). The practice of marking the parts of speech of
words in a phrase is known as POS tagging. The training procedure makes use of 190 different
POS tags. The LSTM model is a sort of neural network that is well-known for its efficiency in
natural language processing tasks, whereas the HMM model is a statistical model that is
frequently used for POS tagging. The outcomes revealed that in the test dataset, the LSTM model
had a better accuracy of 97%.
In addition to the quantitative analysis, a user interface was developed for interactive usage of
the corpus and visualization of the results of analysis. The interface allows users to search for
words and phrases in the corpus, view their frequency and usage, and visualize the results using
graphs and charts such as Dispersion plots and frequency distribution plot. The result files were
created in different formats, including CSV, JSON, and XML, and optimal formats were selected
for the front-end part.
Overall, this research provides valuable insights into the Azerbaijani language corpus and its
linguistic characteristics. The study demonstrates the potential of corpus linguistics and
computational methods for studying languages and their characteristics. The findings can be used
in various applications, including natural language processing, machine learning, and text
analysis. Further research can be conducted to expand the corpus and explore other linguistic
aspects of the Azerbaijani language.