Text Summarization in Azerbaijani Language

Aliyev, Azar

Library MyADA ADA University

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

Text Summarization in Azerbaijani Language

Aliyev, Azar

URI: http://hdl.handle.net/20.500.12181/931

Date: 2023-04

Abstract:

The research aims to answer the question of how to achieve text summarization in the Azerbaijani language. There are two types of text summarization, being extractive and abstractive. Abstractive text summarization is under the spotlight of this paper since it requires much more complex approaches to achieve the goal. Throughout the research, the problems of the models being trained primarily for English are highlighted and the difficulties of adapting them to the Azerbaijani language are discussed. Azerbaijani language contains 32 letters in its alphabet, meaning that there are extra language-specific characters compared to English. It was shown that tokenization plays a vital role in the model having a successful outcome. It was shown that changing the default normalization layer parameters of the tokenizer tuned out to be extremely helpful towards the results. Three various tokenizers were considered for this task, being WordPiece, SentencePiece Byte-Per-Encoding and Byte Level Byte-Per-Encoding tokenizers. The results of all these three tokenizers were analyzed and determined that WordPiece tokenizer gives better results and is much more space efficient due to its nature. Apart from that, different network architectures were considered for this summarization task. Advantages and disadvantages of RNNs, LSTMs, and CNNs with the addition of attention mechanism were listed and importance of transformers compared to the three architectures was highlighted. Azerbaijani news dataset was utilized to build the vocabulary and train the model on. BERT model was chosen to create a model for Azerbaijani text summarization. It was observed that BERT model can achieve feasible results even with moderately small amounts of data. Pre trained multilingual models such as mBERT did not prove to be worthy since it is trained on very small amount of data. Additionally, T5 and RoBERTa models were also tested for this task, and they did not achieve acceptable results compared to BERT itself. T5 was computationally demanding and required more training to have a result could be considered as successful. RoBERTa, however, was not suitable for summarization task at all.

Show full item record