Abstract:
In this thesis, we have investigated the challenges of building multimodal vision-language models for image retrieval tasks in a low-resource languages, particularly in Azerbaijani. Due to
the multiple reasons this task is challenging for the low resource languages. First reason is
the limitations of large-scale vision-language models, such as CLIP, which does not support
approximately 90% of low-resource languages. Another reason is present computational challenges, even when there are Parameter Efficient Fine-Tuning (PEFT) methods. So, we have
explored the integration of a Multilingual BERT with the base image encoder models to build
custom models from the ground up for those languages. Our investigations include a variety of
model architectures, including ResNet50, EfficientNet0, Vision Transformer (ViT), Tiny Swin
Transformer alongside the multilingual BERT model, to evaluate performance across different datasets. Our findings shows significant variations in model performance, influenced by
data quality and annotation richness. For instance, models generally show better in-domain
performance on the MSCOCO dataset compared to Flickr datasets, due to the MSCOCO’s
comprehensive annotations and diverse image content which is more than 300K images in total. To solve these challenges, our study includes the generation of synthetic datasets through
machine translation for Azerbaijani and image augmentation, along with a comparative analysis of various encoder models to establish efficient, cost-effective training strategies for low
resource languages. Augmented image data boosted model performance, with EfficientNet0
achieving 0.87 MAP on Flickr30k, while almost all the models struggled with out-domain generalization. Tiny Swin Transformer exhibited adaptability across datasets with consistent 0.80
MAP scores. Our approach not only enhances model adaptability across different domains but
also contributes to the broader application of vision-language retrieval systems in low-resource
languages. By sharing our configurations and results, we aim to facilitate further research
and technological adaptation across diverse linguistic landscapes. We release our code and
pre-trained model weights at https://github.com/aliasgerovs/azclip