ADA Library Digital Repository

Investigation of Multimodal Vision-Language Tasks in Low-Resource Languages

Show simple item record

dc.contributor.author Asgarov, Ali
dc.date.accessioned 2025-04-11T06:53:48Z
dc.date.available 2025-04-11T06:53:48Z
dc.date.issued 2024-04
dc.identifier.uri http://hdl.handle.net/20.500.12181/1121
dc.description.abstract In this thesis, we have investigated the challenges of building multimodal vision-language models for image retrieval tasks in a low-resource languages, particularly in Azerbaijani. Due to the multiple reasons this task is challenging for the low resource languages. First reason is the limitations of large-scale vision-language models, such as CLIP, which does not support approximately 90% of low-resource languages. Another reason is present computational challenges, even when there are Parameter Efficient Fine-Tuning (PEFT) methods. So, we have explored the integration of a Multilingual BERT with the base image encoder models to build custom models from the ground up for those languages. Our investigations include a variety of model architectures, including ResNet50, EfficientNet0, Vision Transformer (ViT), Tiny Swin Transformer alongside the multilingual BERT model, to evaluate performance across different datasets. Our findings shows significant variations in model performance, influenced by data quality and annotation richness. For instance, models generally show better in-domain performance on the MSCOCO dataset compared to Flickr datasets, due to the MSCOCO’s comprehensive annotations and diverse image content which is more than 300K images in total. To solve these challenges, our study includes the generation of synthetic datasets through machine translation for Azerbaijani and image augmentation, along with a comparative analysis of various encoder models to establish efficient, cost-effective training strategies for low resource languages. Augmented image data boosted model performance, with EfficientNet0 achieving 0.87 MAP on Flickr30k, while almost all the models struggled with out-domain generalization. Tiny Swin Transformer exhibited adaptability across datasets with consistent 0.80 MAP scores. Our approach not only enhances model adaptability across different domains but also contributes to the broader application of vision-language retrieval systems in low-resource languages. By sharing our configurations and results, we aim to facilitate further research and technological adaptation across diverse linguistic landscapes. We release our code and pre-trained model weights at https://github.com/aliasgerovs/azclip en_US
dc.language.iso en en_US
dc.publisher ADA University en_US
dc.rights Attribution-NonCommercial-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/us/ *
dc.subject Deep learning -- Applications en_US
dc.subject Low-resource languages -- Computational linguistics en_US
dc.subject Computer vision -- Image retrieval en_US
dc.subject Multilingual models (Artificial intelligence) en_US
dc.title Investigation of Multimodal Vision-Language Tasks in Low-Resource Languages en_US
dc.type Thesis en_US


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Search ADA LDR


Advanced Search

Browse

My Account