Abstract:
In this thesis, Named Entity Recognition (NER) as applied to the Azerbaijani language is
studied. The Azerbaijani language is both a morphologically rich language, and a lowresource language. Due to a lack of annotated data, a semi-automated cross-lingual
annotation strategy is proposed: Annotate English sentences through spaCy, subsequently
translating the annotated English sentences into Azerbaijani, and then aligning the tokens,
and project the entities through the aligned tokens. Using this procedure, a semi-automated
silver standard Azerbaijani NER corpus was developed containing approximately 90,000
sentences, 1.17 million tokens, and 37 entity types.
Five models were developed and compared: a CRF baseline with handcrafted features, a
BiLSTM-CRF model with FastText embeddings, a BERT-BiLSTM-CRF model, an
ALBERTo-BiLSTM-CRF model, and an XLM-RoBERTa-BiLSTM-CRF model. Each model
utilized the same dataset and performed the same training and evaluations using token-level
metrics and named entity evaluation metrics such as precision, recall, F1 score, and token
accuracy.
The experimental results found that multilingual transformers significantly outperformed
traditional feature extraction methods. Notably, the XLM-RoBERTa-BiLSTM-CRF model
reported the best results, scoring an overall weighted F1-score of 0.7918, precision of 0.8505,
recall of 0.7574 and a token accuracy of 0.7823. However, challenges persisted in
recognizing rare entities, dealing with complex morphology and managing any translation
artifacts in the automatically generated dataset.
The study indicates the usefulness of transfer learning and multilingual pretrained models
for low-resource languages. Future work in NER for the Azerbaijani language should utilize
explicit morphological analysis, augment data as best as possible with multilingual/synthetic
data, investigate ensemble models, and build a gold standard manually annotated NER corpus
to help boost quality within the performance of Azerbaijani NER models. The research
conducted offers a strong baseline for eventual improvements in the processing of the
Azerbaijani language and the results obtained offer insights to potential future work
concerning other low-resource and morphologically rich languages