Text Summarization in Azerbaijani Language

Aliyev, Azar

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

dc.contributor.author	Aliyev, Azar
dc.date.accessioned	2024-12-19T23:52:26Z
dc.date.available	2024-12-19T23:52:26Z
dc.date.issued	2023-04
dc.identifier.uri	http://hdl.handle.net/20.500.12181/931
dc.description.abstract	The research aims to answer the question of how to achieve text summarization in the Azerbaijani language. There are two types of text summarization, being extractive and abstractive. Abstractive text summarization is under the spotlight of this paper since it requires much more complex approaches to achieve the goal. Throughout the research, the problems of the models being trained primarily for English are highlighted and the difficulties of adapting them to the Azerbaijani language are discussed. Azerbaijani language contains 32 letters in its alphabet, meaning that there are extra language-specific characters compared to English. It was shown that tokenization plays a vital role in the model having a successful outcome. It was shown that changing the default normalization layer parameters of the tokenizer tuned out to be extremely helpful towards the results. Three various tokenizers were considered for this task, being WordPiece, SentencePiece Byte-Per-Encoding and Byte Level Byte-Per-Encoding tokenizers. The results of all these three tokenizers were analyzed and determined that WordPiece tokenizer gives better results and is much more space efficient due to its nature. Apart from that, different network architectures were considered for this summarization task. Advantages and disadvantages of RNNs, LSTMs, and CNNs with the addition of attention mechanism were listed and importance of transformers compared to the three architectures was highlighted. Azerbaijani news dataset was utilized to build the vocabulary and train the model on. BERT model was chosen to create a model for Azerbaijani text summarization. It was observed that BERT model can achieve feasible results even with moderately small amounts of data. Pre trained multilingual models such as mBERT did not prove to be worthy since it is trained on very small amount of data. Additionally, T5 and RoBERTa models were also tested for this task, and they did not achieve acceptable results compared to BERT itself. T5 was computationally demanding and required more training to have a result could be considered as successful. RoBERTa, however, was not suitable for summarization task at all.	en_US
dc.language.iso	en	en_US
dc.publisher	ADA University	en_US
dc.relation	School of IT and Engineering	en_US
dc.relation	Graduate program	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Natural language processing -- Azerbaijani language	en_US
dc.subject	Text summarization -- Computational methods	en_US
dc.subject	Machine learning -- Multilingual models	en_US
dc.subject	IT and Engineering	en_US
dc.title	Text Summarization in Azerbaijani Language	en_US
dc.type	Thesis	en_US