Investigation of Multimodal Vision-Language Tasks in Low-Resource Languages

Asgarov, Ali

Library MyADA ADA University

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

Investigation of Multimodal Vision-Language Tasks in Low-Resource Languages

Asgarov, Ali

URI: http://hdl.handle.net/20.500.12181/1121

Date: 2024-04

Abstract:

In this thesis, we have investigated the challenges of building multimodal vision-language models for image retrieval tasks in a low-resource languages, particularly in Azerbaijani. Due to the multiple reasons this task is challenging for the low resource languages. First reason is the limitations of large-scale vision-language models, such as CLIP, which does not support approximately 90% of low-resource languages. Another reason is present computational challenges, even when there are Parameter Efficient Fine-Tuning (PEFT) methods. So, we have explored the integration of a Multilingual BERT with the base image encoder models to build custom models from the ground up for those languages. Our investigations include a variety of model architectures, including ResNet50, EfficientNet0, Vision Transformer (ViT), Tiny Swin Transformer alongside the multilingual BERT model, to evaluate performance across different datasets. Our findings shows significant variations in model performance, influenced by data quality and annotation richness. For instance, models generally show better in-domain performance on the MSCOCO dataset compared to Flickr datasets, due to the MSCOCO’s comprehensive annotations and diverse image content which is more than 300K images in total. To solve these challenges, our study includes the generation of synthetic datasets through machine translation for Azerbaijani and image augmentation, along with a comparative analysis of various encoder models to establish efficient, cost-effective training strategies for low resource languages. Augmented image data boosted model performance, with EfficientNet0 achieving 0.87 MAP on Flickr30k, while almost all the models struggled with out-domain generalization. Tiny Swin Transformer exhibited adaptability across datasets with consistent 0.80 MAP scores. Our approach not only enhances model adaptability across different domains but also contributes to the broader application of vision-language retrieval systems in low-resource languages. By sharing our configurations and results, we aim to facilitate further research and technological adaptation across diverse linguistic landscapes. We release our code and pre-trained model weights at https://github.com/aliasgerovs/azclip

Show full item record