Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Aliyev, Tofig

Library MyADA ADA University

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Aliyev, Tofig

URI: http://hdl.handle.net/20.500.12181/1503

Date: 2025-04

Abstract:

In this thesis, Named Entity Recognition (NER) as applied to the Azerbaijani language is studied. The Azerbaijani language is both a morphologically rich language, and a lowresource language. Due to a lack of annotated data, a semi-automated cross-lingual annotation strategy is proposed: Annotate English sentences through spaCy, subsequently translating the annotated English sentences into Azerbaijani, and then aligning the tokens, and project the entities through the aligned tokens. Using this procedure, a semi-automated silver standard Azerbaijani NER corpus was developed containing approximately 90,000 sentences, 1.17 million tokens, and 37 entity types. Five models were developed and compared: a CRF baseline with handcrafted features, a BiLSTM-CRF model with FastText embeddings, a BERT-BiLSTM-CRF model, an ALBERTo-BiLSTM-CRF model, and an XLM-RoBERTa-BiLSTM-CRF model. Each model utilized the same dataset and performed the same training and evaluations using token-level metrics and named entity evaluation metrics such as precision, recall, F1 score, and token accuracy. The experimental results found that multilingual transformers significantly outperformed traditional feature extraction methods. Notably, the XLM-RoBERTa-BiLSTM-CRF model reported the best results, scoring an overall weighted F1-score of 0.7918, precision of 0.8505, recall of 0.7574 and a token accuracy of 0.7823. However, challenges persisted in recognizing rare entities, dealing with complex morphology and managing any translation artifacts in the automatically generated dataset. The study indicates the usefulness of transfer learning and multilingual pretrained models for low-resource languages. Future work in NER for the Azerbaijani language should utilize explicit morphological analysis, augment data as best as possible with multilingual/synthetic data, investigate ensemble models, and build a gold standard manually annotated NER corpus to help boost quality within the performance of Azerbaijani NER models. The research conducted offers a strong baseline for eventual improvements in the processing of the Azerbaijani language and the results obtained offer insights to potential future work concerning other low-resource and morphologically rich languages

Show full item record

Files in this item

Files	Size	Format	View
There are no files associated with this item.

The following license files are associated with this item:

Creative Commons

This item appears in the following Collection(s)

School of Information Technologies and Engineering

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Abstract:

Files in this item

This item appears in the following Collection(s)

Search ADA LDR

Browse

All of ADA LDR

This Collection

My Account

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Abstract:

Files in this item

This item appears in the following Collection(s)

Related items

Search ADA LDR

Browse

All of ADA LDR

This Collection

My Account