| dc.contributor.author | Aliyev, Tofig | |
| dc.date.accessioned | 2025-10-27T10:29:34Z | |
| dc.date.available | 2025-10-27T10:29:34Z | |
| dc.date.issued | 2025-04 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12181/1503 | |
| dc.description.abstract | In this thesis, Named Entity Recognition (NER) as applied to the Azerbaijani language is studied. The Azerbaijani language is both a morphologically rich language, and a lowresource language. Due to a lack of annotated data, a semi-automated cross-lingual annotation strategy is proposed: Annotate English sentences through spaCy, subsequently translating the annotated English sentences into Azerbaijani, and then aligning the tokens, and project the entities through the aligned tokens. Using this procedure, a semi-automated silver standard Azerbaijani NER corpus was developed containing approximately 90,000 sentences, 1.17 million tokens, and 37 entity types. Five models were developed and compared: a CRF baseline with handcrafted features, a BiLSTM-CRF model with FastText embeddings, a BERT-BiLSTM-CRF model, an ALBERTo-BiLSTM-CRF model, and an XLM-RoBERTa-BiLSTM-CRF model. Each model utilized the same dataset and performed the same training and evaluations using token-level metrics and named entity evaluation metrics such as precision, recall, F1 score, and token accuracy. The experimental results found that multilingual transformers significantly outperformed traditional feature extraction methods. Notably, the XLM-RoBERTa-BiLSTM-CRF model reported the best results, scoring an overall weighted F1-score of 0.7918, precision of 0.8505, recall of 0.7574 and a token accuracy of 0.7823. However, challenges persisted in recognizing rare entities, dealing with complex morphology and managing any translation artifacts in the automatically generated dataset. The study indicates the usefulness of transfer learning and multilingual pretrained models for low-resource languages. Future work in NER for the Azerbaijani language should utilize explicit morphological analysis, augment data as best as possible with multilingual/synthetic data, investigate ensemble models, and build a gold standard manually annotated NER corpus to help boost quality within the performance of Azerbaijani NER models. The research conducted offers a strong baseline for eventual improvements in the processing of the Azerbaijani language and the results obtained offer insights to potential future work concerning other low-resource and morphologically rich languages | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | ADA University | en_US |
| dc.rights | Attribution-NonCommercial-NoDerivs 3.0 United States | * |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/us/ | * |
| dc.subject | Computational linguistics -- Azerbaijani language | en_US |
| dc.subject | Machine learning -- Natural language processing | en_US |
| dc.subject | Morphology (Linguistics) -- Computational models | en_US |
| dc.subject | Low-resource languages -- Language processing | en_US |
| dc.subject | Corpora (Linguistics) -- Azerbaijani language | en_US |
| dc.title | Improving Named Entity Recognition for the Azerbaijani Language Using Neural and Non-Neural CRF-Based Models | en_US |
| dc.type | Thesis | en_US |
| dcterms.accessRights | Absolute Embargo (No access without the author's permission) |
| Files | Size | Format | View |
|---|---|---|---|
|
There are no files associated with this item. |
|||
The following license files are associated with this item: