Development of Generative Models for Azerbaijani Text

Mahammadli, Eljan

Library MyADA ADA University

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

Development of Generative Models for Azerbaijani Text

Mahammadli, Eljan

URI: http://hdl.handle.net/20.500.12181/992

Date: 2024-04

Abstract:

One of the significant problem is the lack of open-source, generative-based small language models available for the Azerbaijan language, an issue which remains despite the existence large-scale, privately funded offerings such as OpenAI GPT-4 or Claude Opus. Given the urgent demand for more serviceable models with the potential to both produce and comprehend Azerbaijani, this study marks the first-ever endeavor to pre-train a generative-based small language model in a targeted manner for this minority language. While existing large models are proprietary and the trend is moving towards open-weight or open-source Large Language Models with Fine-tuned weights for specific tasks, the best-case scenario is given developed model. Unlike many other languages, it is virtually impossible to f ine-tune a dedicated model in Azerbaijani in the absence of either comprehensive or clean datasets for pre-training powerful models to be measured in billions of tokens. Considering the outlined issues, the research employed the collection of the largest Azerbaijani text corpus available to date, estimated at nearly 3 billion tokens, from a breadth of sources, in cluding Wikipedia, OSCAR, mC4, books, news, and other scraped data. The dataset curation process was thorough, and many filtering steps were taken, including data sampling and prepa ration, prior to the model training. A 150-million-parameter decoder-only pre-trained language model was then created with the use of Llama2 architecture to facilitate autoregressive text generation in Azerbaijani. Afterward, the model was instruction-fine-tuned with the Alpaca instruction dataset, consequently turning it into a question-answering model in Azerbaijani. Although the model is less powerful, as evidenced by its smaller size containing only 150 million parameters, compared to larger multi-billion parameter models, it is unique in its ability to generate text in Azerbaijani with few mistakes and answer simple questions. This is a significant outcome for researchers given that the other open-source models, including the largest multi billion ones with billions of parameters trained on trillions of tokens, cannot demonstrate such linguistic proficiency in the Azerbaijani language. As such, the present study is a vital first step in the field of Azerbaijani language processing, allowing for potential improved performance. The work presents a foundation for further research and applications of fine-tuned generative models in NLP for low-resource languages. It also confirms the possibility of creating effective language models and shows that the size of model training data and computational resources could be easily increased by the scaling laws of machine learning. Thus, the described project f ills a notable gap in Azerbaijani language processing and gives a way to research and pro duce further applications for NLP. The work could benefit the general NLP area by showing the potential of fine-tuned generative models to be used as a tool for enhancing language understanding and multimodal generation and shedding light on prospects for their further improvement and applications for other low-resource languages. The source code and supplementary materials for this project are available on GitHub at https://github.com/eljanmahammadli/AzLlama for replication and further research.

Show full item record