Abstract:
The detection of fake news has become an increasingly critical challenge in the digital information age,
particularly for low-resource languages like Azerbaijani, where limited resources exist for automated
misinformation detection. This thesis explores the adaptation of large multilingual transformer models,
specifically BERT and RoBERTa, for the task of Azerbaijani fake news classification. A translated
version of the LIAR dataset was used to construct a labeled corpus, where statements were categorized
into binary classes representing "Real" and "Fake" news. Extensive preprocessing, including metadata
integration and length analysis for tokenization, was conducted to prepare the data for model training.
Both models were fine-tuned and evaluated on a held-out test set of 1,267 examples. The multilingual
BERT model achieved an overall accuracy of 68%, outperforming the RoBERTa model, which
reached 65%. However, a deeper analysis revealed that RoBERTa exhibited better class balance,
achieving relatively higher recall for the minority "Real" class, whereas BERT demonstrated a stronger
bias toward the majority "Fake" class. Despite BERT’s superior overall metrics, RoBERTa’s more
equitable performance across classes suggests its potential suitability for applications where balanced
prediction is critical.
The results validate the feasibility of adapting transformer models for Azerbaijani fake news detection
but also highlight persistent challenges, particularly in achieving high accuracy on minority classes.
Future directions are suggested, including dataset expansion with native Azerbaijani sources, domainadaptive pretraining, and multimodal approaches. This work lays a foundation for advancing automated
misinformation detection tools for underrepresented languages.