IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

Abstract

As large language models (LLMs) see increasing adoption across the globe, itis imperative for LLMs to be representative of the linguistic diversity of theworld. India is a linguistically diverse country of 1.4 Billion people. Tofacilitate research on multilingual LLM evaluation, we release IndicGenBench -the largest benchmark for evaluating LLMs on user-facing generation tasksacross a diverse set 29 of Indic languages covering 13 scripts and 4 languagefamilies. IndicGenBench is composed of diverse generation tasks likecross-lingual summarization, machine translation, and cross-lingual questionanswering. IndicGenBench extends existing benchmarks to many Indic languagesthrough human curation providing multi-way parallel evaluation data for manyunder-represented Indic languages for the first time. We evaluate a wide rangeof proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5,Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largestPaLM-2 models performs the best on most tasks, however, there is a significantperformance gap in all languages compared to English showing that furtherresearch is needed for the development of more inclusive multilingual languagemodels. IndicGenBench is released atwww.github.com/google-research-datasets/indic-gen-bench

Quick Read (beta)

loading the full paper ...