Quantitative and Semantic Analysis of National Azerbaijan Corpus

Asgarzade, Abdulla

Library MyADA ADA University

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

Quantitative and Semantic Analysis of National Azerbaijan Corpus

Asgarzade, Abdulla

URI: http://hdl.handle.net/20.500.12181/933

Date: 2023-04

Abstract:

This master thesis focuses on the quantitative (statistical) and semantic analysis of the Azerbaijani language corpus. The research begins with the creation of the corpus by collecting textual data on Azerbaijani language. The corpus was composed of various types of texts, including fiction, non-fiction, news articles, and social media posts. The texts were pre-processed by removing stop words, punctuation marks, and special characters. The pre-processed texts were then tokenized and lemmatized, which is the method of breaking down words to their most basic form. The corpus was analyzed using various quantitative techniques, such as frequency analysis, ngrams, concordance analysis, and word semantic similarity with different metrics. Frequency analysis involves counting the occurrence of words in the corpus, which helps identify the most commonly used words and their frequency. Ngrams analysis involves counting the frequency of pairs or triplets of words, which helps identify common collocations and phrases. Finding a word or phrase's context throughout the corpus using concordance analysis allows for the identification of a term's usage and meaning in various settings. Word semantic similarity analysis involves measuring the semantic similarity between words using different metrics, which helps identify the relatedness of words in the corpus. The POS tagging of the corpus is the other major component of the thesis. Implemented and tested were two distinct Part-of-Speech (POS) tagging models using Hidden Markov Models (HMM) and Long Short-Term Memory (LSTM). The practice of marking the parts of speech of words in a phrase is known as POS tagging. The training procedure makes use of 190 different POS tags. The LSTM model is a sort of neural network that is well-known for its efficiency in natural language processing tasks, whereas the HMM model is a statistical model that is frequently used for POS tagging. The outcomes revealed that in the test dataset, the LSTM model had a better accuracy of 97%. In addition to the quantitative analysis, a user interface was developed for interactive usage of the corpus and visualization of the results of analysis. The interface allows users to search for words and phrases in the corpus, view their frequency and usage, and visualize the results using graphs and charts such as Dispersion plots and frequency distribution plot. The result files were created in different formats, including CSV, JSON, and XML, and optimal formats were selected for the front-end part. Overall, this research provides valuable insights into the Azerbaijani language corpus and its linguistic characteristics. The study demonstrates the potential of corpus linguistics and computational methods for studying languages and their characteristics. The findings can be used in various applications, including natural language processing, machine learning, and text analysis. Further research can be conducted to expand the corpus and explore other linguistic aspects of the Azerbaijani language.

Show full item record