Investigation of Multimodal Vision-Language Tasks in Low-Resource Languages

Asgarov, Ali

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

dc.contributor.author	Asgarov, Ali
dc.date.accessioned	2025-04-11T06:53:48Z
dc.date.available	2025-04-11T06:53:48Z
dc.date.issued	2024-04
dc.identifier.uri	http://hdl.handle.net/20.500.12181/1121
dc.description.abstract	In this thesis, we have investigated the challenges of building multimodal vision-language models for image retrieval tasks in a low-resource languages, particularly in Azerbaijani. Due to the multiple reasons this task is challenging for the low resource languages. First reason is the limitations of large-scale vision-language models, such as CLIP, which does not support approximately 90% of low-resource languages. Another reason is present computational challenges, even when there are Parameter Efficient Fine-Tuning (PEFT) methods. So, we have explored the integration of a Multilingual BERT with the base image encoder models to build custom models from the ground up for those languages. Our investigations include a variety of model architectures, including ResNet50, EfficientNet0, Vision Transformer (ViT), Tiny Swin Transformer alongside the multilingual BERT model, to evaluate performance across different datasets. Our findings shows significant variations in model performance, influenced by data quality and annotation richness. For instance, models generally show better in-domain performance on the MSCOCO dataset compared to Flickr datasets, due to the MSCOCO’s comprehensive annotations and diverse image content which is more than 300K images in total. To solve these challenges, our study includes the generation of synthetic datasets through machine translation for Azerbaijani and image augmentation, along with a comparative analysis of various encoder models to establish efficient, cost-effective training strategies for low resource languages. Augmented image data boosted model performance, with EfficientNet0 achieving 0.87 MAP on Flickr30k, while almost all the models struggled with out-domain generalization. Tiny Swin Transformer exhibited adaptability across datasets with consistent 0.80 MAP scores. Our approach not only enhances model adaptability across different domains but also contributes to the broader application of vision-language retrieval systems in low-resource languages. By sharing our configurations and results, we aim to facilitate further research and technological adaptation across diverse linguistic landscapes. We release our code and pre-trained model weights at https://github.com/aliasgerovs/azclip	en_US
dc.language.iso	en	en_US
dc.publisher	ADA University	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Deep learning -- Applications	en_US
dc.subject	Low-resource languages -- Computational linguistics	en_US
dc.subject	Computer vision -- Image retrieval	en_US
dc.subject	Multilingual models (Artificial intelligence)	en_US
dc.title	Investigation of Multimodal Vision-Language Tasks in Low-Resource Languages	en_US
dc.type	Thesis	en_US