What Drives Performance in Multilingual Language Models?

Abstract

This study investigates the factors influencing the performance ofmultilingual large language models (MLLMs) across diverse languages. We study 6MLLMs, including masked language models, autoregressive models, andinstruction-tuned LLMs, on the SIB-200 dataset, a topic classification datasetencompassing 204 languages. Our analysis considers three scenarios: ALLlanguages, SEEN languages (present in the model's pretraining data), and UNSEENlanguages (not present or documented in the model's pretraining data in anymeaningful way). We examine the impact of factors such as pretraining datasize, general resource availability, language family, and script type on modelperformance. Decision tree analysis reveals that pretraining data size is themost influential factor for SEEN languages. However, interestingly, script typeand language family are crucial for UNSEEN languages, highlighting theimportance of cross-lingual transfer learning. Notably, model size andarchitecture do not significantly alter the most important features identified.Our findings provide valuable insights into the strengths and limitations ofcurrent MLLMs and hope to guide the development of more effective and equitablemultilingual NLP systems.

Quick Read (beta)

loading the full paper ...