TaCo: Enhancing Cross-Lingual Transfer for Low-Resource Languages in LLMs through Translation-Assisted Chain-of-Thought Processes

Abstract

Creating multilingual LLMs poses a significant challenge. Pretraining orfine-tuning LLMs to adopt new languages is evidently very costly. Furthermore,there exist limitations concerning benchmark datasets and the metrics used tomeasure model performance in multilingual settings. This paper proposescost-effective solutions to both aforementioned challenges. Firstly, weintroduce the Multilingual Instruction-Tuning Dataset (MITS), comprised ofAlpaca-52K, Dolly-15K, and Vicuna Benchmark translations into 132 languages.Secondly, we propose a new method called \emph{TaCo: Translation-AssistedCross-Linguality}, which utilizes translations in a chain-of-thought process toinstruction-tune LLMs on new languages through a curriculum-learning process.As a proof of concept, we experimented with the instruction-tuned Guanaco-33Bmodel, performing further instruction tuning using our proposed TaCo method inthree low-resource languages and one high-resource language. Our resultsindicate that the TaCo method impresses GPT-4 with an 82\% score for alow-resource language in the Vicuna Benchmark dataset, doubling the performancein contrast to instruction tuning alone. Furthermore, TaCo shows promise increating multilingual LLMs, even for low-resource languages. We have releasedour datasets and model adapters\footnote{https://github.com/UNHSAILLab/TaCo} ,encouraging the research community to utilize these resources to advance workon multilingual LLMs.

Quick Read (beta)

loading the full paper ...