ADA Library Digital Repository

Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models

Show simple item record

dc.contributor.author Aliyev, Tofig
dc.date.accessioned 2025-10-27T10:29:34Z
dc.date.available 2025-10-27T10:29:34Z
dc.date.issued 2025-04
dc.identifier.uri http://hdl.handle.net/20.500.12181/1503
dc.description.abstract In this thesis, Named Entity Recognition (NER) as applied to the Azerbaijani language is studied. The Azerbaijani language is both a morphologically rich language, and a lowresource language. Due to a lack of annotated data, a semi-automated cross-lingual annotation strategy is proposed: Annotate English sentences through spaCy, subsequently translating the annotated English sentences into Azerbaijani, and then aligning the tokens, and project the entities through the aligned tokens. Using this procedure, a semi-automated silver standard Azerbaijani NER corpus was developed containing approximately 90,000 sentences, 1.17 million tokens, and 37 entity types. Five models were developed and compared: a CRF baseline with handcrafted features, a BiLSTM-CRF model with FastText embeddings, a BERT-BiLSTM-CRF model, an ALBERTo-BiLSTM-CRF model, and an XLM-RoBERTa-BiLSTM-CRF model. Each model utilized the same dataset and performed the same training and evaluations using token-level metrics and named entity evaluation metrics such as precision, recall, F1 score, and token accuracy. The experimental results found that multilingual transformers significantly outperformed traditional feature extraction methods. Notably, the XLM-RoBERTa-BiLSTM-CRF model reported the best results, scoring an overall weighted F1-score of 0.7918, precision of 0.8505, recall of 0.7574 and a token accuracy of 0.7823. However, challenges persisted in recognizing rare entities, dealing with complex morphology and managing any translation artifacts in the automatically generated dataset. The study indicates the usefulness of transfer learning and multilingual pretrained models for low-resource languages. Future work in NER for the Azerbaijani language should utilize explicit morphological analysis, augment data as best as possible with multilingual/synthetic data, investigate ensemble models, and build a gold standard manually annotated NER corpus to help boost quality within the performance of Azerbaijani NER models. The research conducted offers a strong baseline for eventual improvements in the processing of the Azerbaijani language and the results obtained offer insights to potential future work concerning other low-resource and morphologically rich languages en_US
dc.language.iso en en_US
dc.publisher ADA University en_US
dc.rights Attribution-NonCommercial-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/us/ *
dc.subject Computational linguistics -- Azerbaijani language en_US
dc.subject Machine learning -- Natural language processing en_US
dc.subject Morphology (Linguistics) -- Computational models en_US
dc.subject Low-resource languages -- Language processing en_US
dc.subject Corpora (Linguistics) -- Azerbaijani language en_US
dc.title Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models en_US
dc.type Thesis en_US
dcterms.accessRights Absolute Embargo (No access without the author's permission)


Files in this item

Files Size Format View

There are no files associated with this item.

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Search ADA LDR


Advanced Search

Browse

My Account