Abstract:
Dataset or model? This question is frequently asked by newcomers to the realm of machine learning. Many would structure their thoughts as follows: There is always the option to conduct hyperparameter tuning and train the model. However, if the data is of bad quality or insufficient, it is game over for the project. Throughout the history of machine learning, numerous projects have been terminated due to data shortages, with many more being paused to collect more data. That is why addressing data related aspects of the project should be considered in the first place, and the availability of labeled data is crucial for the success of machine learning models.
We explored data augmentation techniques to enhance the accuracy of machine learning models for Azerbaijani text datasets. In this study we have tested a great number of attentionbased approaches as well as new advanced substituting words with synonyms for Azerbaijani and English datasets. All the techniques have been tested using extrinsic and intrinsic methods. For extrinsic evaluation, classification tasks have been used to check the accuracy scores before and after augmentation. By this average accuracy scores of the neural network models have been improvement by 13% in AG news classification task, 3% in IMDB review case, and most importantly 13% in the case of Azerbaijani dataset, OXU news. In intrinsic evaluation, an average Bert embedding cosine similarity score equal to nearly 0.95 has been obtained, while maximum average Bleu and Rouge scores were 0.43 and 0.65. While the main purpose of the project was finding out the low resource requiring methods, attentionbased approaches including transformer paraphrasers and translators still required availability of GPU.
Hence, all the techniques developed in this project can be used by anyone with a single laptop. This advantage is a very important detail of this project since in this project there were 2 main evaluation metrics: efficiency of the technique and its resource requirements. For evaluation of the effectiveness of our data augmentation approach extensive experiments were conducted using various machine learning models on different popular Azerbaijani and English datasets. The original versions of the datasets were compared to their augmented counterparts in various classification tasks, employing both traditional and gradient-based models to assess effectiveness. Regarding resource requirements, considerations were made for size and runtime on CPU. This is done to make the data augmentation process easier and faster for small research projects and startups that do not have a significant amount of budget, since in most of such projects one of the main issues has always been computation power and restrictions in usages of paid services such as Chat-GPT, Cloude, Google translator etc.