SambaLingo: Teaching Large Language Models New Languages

Abstract

Despite the widespread availability of LLMs, there remains a substantial gapin their capabilities and availability across diverse languages. One approachto address these issues has been to take an existing pre-trained LLM andcontinue to train it on new languages. While prior works have experimented withlanguage adaptation, many questions around best practices and methodology havenot been covered. In this paper, we present a comprehensive investigation intothe adaptation of LLMs to new languages. Our study covers the key components inthis process, including vocabulary extension, direct preference optimizationand the data scarcity problem for human alignment in low-resource languages. Wescale these experiments across 9 languages and 2 parameter scales (7B and 70B).We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existinglanguage experts, outperforming all prior published baselines. Additionally,all evaluation code and checkpoints are made public to facilitate futureresearch.

Quick Read (beta)

loading the full paper ...