Abstract:
This study was conducted with the primary focus of exploring the potential of smallscale Large Language Models (LLMs) to handle causal reasoning tasks through two primary interventions, namely, dataset augmentation and instruction fine-tuning. The employed CRASS and Tübingen datasets were augmented with GPT-4 to include additional examples and prompts that defined a wider range of cause-effect relationships and counterfactual reasoning examples. Moreover, the created augmented dataset was further extended to include an Azerbaijani-language counterpart through Google’s MultiLingual BERT (MBERT) and manual modifications to address the lack of resources in language modelling for the Azerbaijani language. Despite the small volume of the currently developed dataset, 20000 samples at the moment of publication of this article, this contribution is of great value as it enables the researchers to test their yet to be developed LLMs that can efficiently handle the Azerbaijani Language on reasoning tasks.
During the investigation into the performance of small-scale LLMs on reasoning tasks, we identified a few unexpected failure modes where the LLM could not follow the instructions present in the prompt. Another interesting aspect of the outcomes was that even GPT4 displayed a tendency to prefer options that are presented first when multiplechoice types of questions are involved. This phenomena was less evident in smaller models like gemma-7b-it and Mistral-7B-int8, and could potentially be associated with improper placement of the transformer attention based on the context that resides within the text.
We then conducted instruction fine-tuning using the augmented Tubingen and CRASS datasets on Google’s gemma-7b-it model. Low-Ranked Adaptation (LoRA) technique was utilized to reduce the computational requirements of the model during training time and Cosine Annealing Learning Scheduler and Cross Entropy Loss metrics were employed for quality control purposes. The fine-tuning process was carried on on both high-end Nvidia V100 GPUs and on the consumer level Nvidia RTX 3090 using the transformers library for model training, and the WandB service for monitoring and logging. The fine-tuning step used full-precision computations without quantization. The evaluation presented in this study demonstrated that the performance of small-scale LLMs can be significantly improved through a combination of dataset augmentation and instruction fine-tuning for causal reasoning tasks.
The results demonstrated that fine-tuned small-scale LLMs had comparable accuracies with only around 8 billion parameters to a much larger, 1.7 trillion parameter, GPT4 model when provided with properly augmented datasets and instructions for counterfactual scenarios. The accuracy of the fine-tuned gemma-7b-it model improved from 53% to 85% on the Tubingen dataset, and from 72% to about 84% on the CRASS benchmark. Therefore, it was concluded that smaller models could potentially match in performance with their much larger counterparts through elevated dataset quality and targeted instruction fine-tuning for deeper understanding in causal reasoning. We also discussed how development of task specific expert small-scale LLMs adept at causal reasoning can pave the way for widespread adoption of the said models across many industries where analytical thinking, decision making and problem solving are at the core of operations. In this context, our results demonstrate that by further fine-tuning small-scale LLMs to become experts at a specific reasoning task, development of a net of small-scale LLMs that offer different perspectives into a problem through controlled bias to aid humans in fast, accurate and thorough decision-making is just a few papers away.
For reproducibility purposes, the code base and the results of this study is presented in this Github repository [?], while the Azerbaijani dataset and an interactive comparison dashboard of the LLMs fine-tuned or studied in this work is hosted in this HuggingFace repository