Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Aliyev, Tofig

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

dc.contributor.author	Aliyev, Tofig
dc.date.accessioned	2025-10-27T10:29:34Z
dc.date.available	2025-10-27T10:29:34Z
dc.date.issued	2025-04
dc.identifier.uri	http://hdl.handle.net/20.500.12181/1503
dc.description.abstract	In this thesis, Named Entity Recognition (NER) as applied to the Azerbaijani language is studied. The Azerbaijani language is both a morphologically rich language, and a lowresource language. Due to a lack of annotated data, a semi-automated cross-lingual annotation strategy is proposed: Annotate English sentences through spaCy, subsequently translating the annotated English sentences into Azerbaijani, and then aligning the tokens, and project the entities through the aligned tokens. Using this procedure, a semi-automated silver standard Azerbaijani NER corpus was developed containing approximately 90,000 sentences, 1.17 million tokens, and 37 entity types. Five models were developed and compared: a CRF baseline with handcrafted features, a BiLSTM-CRF model with FastText embeddings, a BERT-BiLSTM-CRF model, an ALBERTo-BiLSTM-CRF model, and an XLM-RoBERTa-BiLSTM-CRF model. Each model utilized the same dataset and performed the same training and evaluations using token-level metrics and named entity evaluation metrics such as precision, recall, F1 score, and token accuracy. The experimental results found that multilingual transformers significantly outperformed traditional feature extraction methods. Notably, the XLM-RoBERTa-BiLSTM-CRF model reported the best results, scoring an overall weighted F1-score of 0.7918, precision of 0.8505, recall of 0.7574 and a token accuracy of 0.7823. However, challenges persisted in recognizing rare entities, dealing with complex morphology and managing any translation artifacts in the automatically generated dataset. The study indicates the usefulness of transfer learning and multilingual pretrained models for low-resource languages. Future work in NER for the Azerbaijani language should utilize explicit morphological analysis, augment data as best as possible with multilingual/synthetic data, investigate ensemble models, and build a gold standard manually annotated NER corpus to help boost quality within the performance of Azerbaijani NER models. The research conducted offers a strong baseline for eventual improvements in the processing of the Azerbaijani language and the results obtained offer insights to potential future work concerning other low-resource and morphologically rich languages	en_US
dc.language.iso	en	en_US
dc.publisher	ADA University	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Computational linguistics -- Azerbaijani language	en_US
dc.subject	Machine learning -- Natural language processing	en_US
dc.subject	Morphology (Linguistics) -- Computational models	en_US
dc.subject	Low-resource languages -- Language processing	en_US
dc.subject	Corpora (Linguistics) -- Azerbaijani language	en_US
dc.title	Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models	en_US
dc.type	Thesis	en_US
dcterms.accessRights	Absolute Embargo (No access without the author's permission)

Files in this item

Files	Size	Format	View
There are no files associated with this item.

The following license files are associated with this item:

Creative Commons

This item appears in the following Collection(s)

School of Information Technologies and Engineering

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Files in this item

This item appears in the following Collection(s)

Search ADA LDR

Browse

All of ADA LDR

This Collection

My Account

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Files in this item

This item appears in the following Collection(s)

Related items

Search ADA LDR

Browse

All of ADA LDR

This Collection

My Account