How good are Large Language Models on African Languages?

Abstract

Recent advancements in natural language processing have led to theproliferation of large language models (LLMs). These models have been shown toyield good performance, using in-context learning, even on tasks and languagesthey are not trained on. However, their performance on African languages islargely understudied relative to high-resource languages. We present ananalysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4)on six tasks (topic classification, sentiment classification, machinetranslation, summarization, question answering, and named entity recognition)across 60 African languages, spanning different language families andgeographical regions. Our results suggest that all LLMs produce lowerperformance for African languages, and there is a large gap in performancecompared to high-resource languages (such as English) for most tasks. We findthat GPT-4 has an average to good performance on classification tasks, yet itsperformance on generative tasks such as machine translation and summarizationis significantly lacking. Surprisingly, we find that mT0 had the best overallperformance for cross-lingual QA, better than the state-of-the-art supervisedmodel (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we findthe recent Aya model to have comparable result to mT0 in almost all tasksexcept for topic classification where it outperform mT0. Overall, LLaMa 2showed the worst performance, which we believe is due to its English andcode-centric~(around 98%) pre-training corpus. Our findings confirms thatperformance on African languages continues to remain a hurdle for the currentLLMs, underscoring the need for additional efforts to close this gap.

Quick Read (beta)

loading the full paper ...