Abstract:
This thesis investigates efficient text summarization techniques for Azerbaijani documents
through hybrid neural approaches, combining extractive and abstractive methods. Due to the
low-resource nature of the Azerbaijani language, significant challenges arise in developing
reliable summarization systems. To address this, an Azerbaijani-specific dataset was
prepared, consisting of documents paired with human-written summaries, and both
extractive and abstractive summarization models were developed and evaluated.
In the extractive summarization part, sentence embeddings were used to construct similarity
matrices, followed by a TextRank-based algorithm to rank and select key sentences.
Evaluation using ROUGE metrics demonstrated strong results, achieving ROUGE-1 recall,
precision, and F1 scores of approximately 0.47, 0.52, and 0.49 respectively, ROUGE-2
scores around 0.44, 0.47, and 0.45, and ROUGE-L scores comparable to ROUGE-1. These
results indicated a strong alignment between the extracted summaries and human references.
For the abstractive summarization task, the multilingual pre-trained mT5-base model was
fine-tuned on the Azerbaijani dataset. Fine-tuning significantly improved performance over
the baseline. The baseline (zero-shot) mT5 model achieved ROUGE-1, ROUGE-2, and
ROUGE-L scores of approximately 45%, 25%, and 40%, respectively, with BLEU and
METEOR scores around 35% and 42%. After fine-tuning, the model achieved ROUGE-1,
ROUGE-2, and ROUGE-L F1 scores of approximately 64%, 47%, and 57%, with BLEU
and METEOR scores improving to about 32% and 50%, respectively.
Visualizations of evaluation metrics, dataset length distributions, and comparative analysis
were provided to better interpret model performance. Both extractive and abstractive
systems showed significant promise for Azerbaijani text summarization, overcoming
challenges related to data scarcity and linguistic complexity.
This work demonstrates that adapting multilingual pre-trained models and combining them
with classical graph-based extractive methods can yield highly effective summarization
systems for low-resource languages. Future research directions include expanding the
dataset, exploring reinforcement learning techniques, and further optimizing model
architectures for improved generalization across diverse Azerbaijani text domains.