Abstract:
Recent advancements in large language models have significantly contributed to the capabilities
of conversational AI systems by empowering coherent, context-aware dialogues across a wide range
of topics. Although large language models showcase impressive fluency and verbosity, many studies
show that these models are prone to hallucinations, which is highly problematic, especially in domains
where factual reliability is required. This research investigates the use of a structured knowledge base
in the efforts to minimize hallucination in multi-hop question-answering tasks with a focus on a
low-resource language such as Azerbaijani. As a result of the investigations, we choose and adopt the
Graph-Constrained Reasoning framework, which integrates Knowledge Graph structures directly into
the decoding process of LLMs. This framework enforces graph-based constraints while generating
reasoning paths by constructing a KG-Trie from the given dataset. As a preliminary step, we
translated the WebQSP and ComplexWebQuestions datasets into the Azerbaijani language. As a part
of the chosen framework, we fine-tuned multilingual mT5 models (small, base, and large variants) on
the Azerbaijani-translated datasets and evaluated their performance against a monolingual English
baseline model, Qwen2-0.5B. Initial results indicated underfitting due to limited dataset size. This
issue motivated the development of a data augmentation strategy that generates multiple reasoning
paths from all combinations of question and answer entities. The augmentation significantly expanded
the training corpus and led to considerable performance improvements across all mT5 models. The
final evaluation shows that the mT5-large variant achieved an F1 score of 75.2 and a HIT score of
67.0 on the WebQSP dataset, substantially closing the performance gap with the Qwen2-0.5B
baseline. Despite these gains, the mT5 models could not surpass the Qwen2-0.5B model in F1 score.
This study shows that structured grounding, achieved through graph-constrained decoding, can
improve factual reliability in multilingual large language models.