ADA Library Digital Repository

Development of Text Data Augmentation methods for Azerbaijan language

Show simple item record

dc.contributor.author Mukhtarov, Yusif
dc.date.accessioned 2025-03-03T10:57:14Z
dc.date.available 2025-03-03T10:57:14Z
dc.date.issued 2023-04
dc.identifier.uri http://hdl.handle.net/20.500.12181/1028
dc.description.abstract Dataset or model? This question is frequently asked by newcomers to the realm of machine learning. Many would structure their thoughts as follows: There is always the option to conduct hyperparameter tuning and train the model. However, if the data is of bad quality or insufficient, it is game over for the project. Throughout the history of machine learning, numerous projects have been terminated due to data shortages, with many more being paused to collect more data. That is why addressing data related aspects of the project should be considered in the first place, and the availability of labeled data is crucial for the success of machine learning models. We explored data augmentation techniques to enhance the accuracy of machine learning models for Azerbaijani text datasets. In this study we have tested a great number of attentionbased approaches as well as new advanced substituting words with synonyms for Azerbaijani and English datasets. All the techniques have been tested using extrinsic and intrinsic methods. For extrinsic evaluation, classification tasks have been used to check the accuracy scores before and after augmentation. By this average accuracy scores of the neural network models have been improvement by 13% in AG news classification task, 3% in IMDB review case, and most importantly 13% in the case of Azerbaijani dataset, OXU news. In intrinsic evaluation, an average Bert embedding cosine similarity score equal to nearly 0.95 has been obtained, while maximum average Bleu and Rouge scores were 0.43 and 0.65. While the main purpose of the project was finding out the low resource requiring methods, attentionbased approaches including transformer paraphrasers and translators still required availability of GPU. Hence, all the techniques developed in this project can be used by anyone with a single laptop. This advantage is a very important detail of this project since in this project there were 2 main evaluation metrics: efficiency of the technique and its resource requirements. For evaluation of the effectiveness of our data augmentation approach extensive experiments were conducted using various machine learning models on different popular Azerbaijani and English datasets. The original versions of the datasets were compared to their augmented counterparts in various classification tasks, employing both traditional and gradient-based models to assess effectiveness. Regarding resource requirements, considerations were made for size and runtime on CPU. This is done to make the data augmentation process easier and faster for small research projects and startups that do not have a significant amount of budget, since in most of such projects one of the main issues has always been computation power and restrictions in usages of paid services such as Chat-GPT, Cloude, Google translator etc. en_US
dc.language.iso en en_US
dc.publisher ADA University en_US
dc.rights Attribution-NonCommercial-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/us/ *
dc.subject Azerbaijan -- Language -- Text processing (Computer science) en_US
dc.subject Resource - efficient machine learning methods en_US
dc.subject Data augmentation (Machine learning) en_US
dc.subject Text classification (Computer science) en_US
dc.title Development of Text Data Augmentation methods for Azerbaijan language en_US
dc.type Thesis en_US


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Search ADA LDR


Advanced Search

Browse

My Account