Abstract:
People are connected now through social media and not just in-person interactions.
Every year, there are more social media applications, and with them comes increased user
engagement and posts being shared. Hence, a good machine learning model for the
classification of such posts is required. Among the difficulties are challenges in effectively
classifying multimodal content on social media. For instance, if the program cannot identify a
post as harmful or misleading, misinformation and social risks can spread. This thesis
investigates whether multi-modal NLP models can classify social media posts. Furthermore,
the objective is to classify the posts in the Azerbaijani language, which creates another obstacle
to dataset availability for low-resource languages like Azerbaijani. Thus, a dataset was
prepared of approximately 10,000 Azerbaijani-language social media posts containing textual
and visual data. The dataset underwent extensive preprocessing, including text tokenization,
image resizing to 224×224 pixels, and feature normalization. We evaluated models including
FLAVA, BLIP, ViLT, and custom BERT+ResNet-based fusion baselines. Early, late, and
hybrid fusion strategies were used to evaluate multimodal classification effectiveness.
Performances were assessed using accuracy, macro-F1, and evaluation loss, with results
contextualized against known benchmarks in multi-modal classification. With 87.6% accuracy
and 87.1% macro-F1, FLAVA showed excellent cross-modal representation learning. After
task-specific fine-tuning, BLIP came in at 83.8% accuracy and 83.3% macro-F1. Among the
fusion baselines, BERT+ResNet with early fusion showed good performance (85.5% accuracy,
85.2% macro-F1), stressing the possibility of lightweight substitutes. With reasonable
outcomes (83.2% accuracy, 82.7% macro-F1), ViLT provided an effective transformer-only
solution. ALBEF, while architecturally promising due to its hybrid fusion and contrastive
alignment, underperformed on this task (66.6% accuracy, 48.9% macro-F1), possibly due to
vocabulary mismatch and inadequate adaptation to Azerbaijani content. The results reveal the
trade-offs between accuracy, model complexity, and computing economy in multimodal
NLP— especially in low-resource environments such as Azerbaijan. Future work will focus on
fusion structures with attention mechanisms, dataset enlargement to better represent language
variation, and explainable artificial intelligence tools like Grad-CAM and SHAP. Overall, this
study aims at creating inclusive, multimodal artificial intelligence systems for low-resource
languages in order to facilitate social media monitoring and analysis.