MaLA-500: Massive Language Adaptation of Large Language Models

Abstract

Large language models (LLMs) have advanced the state of the art in naturallanguage processing. However, their predominant design for English or a limitedset of languages creates a substantial gap in their effectiveness forlow-resource languages. To bridge this gap, we introduce MaLA-500, a novellarge language model designed to cover an extensive range of 534 languages. Totrain MaLA-500, we employ vocabulary extension and continued pretraining onLLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 isbetter at predicting the given texts of low-resource languages than existingmultilingual LLMs. Moreover, the extrinsic evaluation of in-context learningshows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by asignificant margin, i.e., 11.68% and 4.82% marco-average accuracy acrosslanguages. We release MaLA-500 at https://huggingface.co/MaLA-LM

Quick Read (beta)

loading the full paper ...