Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability

Abstract

ChatGPT has recently emerged as a powerful NLP tool that can carry out avariety of tasks. However, the range of languages ChatGPT can handle remainslargely a mystery. To uncover which languages ChatGPT `knows', we investigateits language identification (LID) abilities. For this purpose, we compileBabel-670, a benchmark comprising 670 languages representing 24 languagefamilies spoken in five continents. Languages in Babel-670 run the gamut fromthe very high-resource to the very low-resource. We then study ChatGPT's (bothGPT-3.5 and GPT-4) ability to (i) identify language names and language codes(ii) under zero- and few-shot conditions (iii) with and without provision of alabel set. When compared to smaller finetuned LID tools, we find that ChatGPTlags behind. For example, it has poor performance on African languages. Weconclude that current large language models would benefit from furtherdevelopment before they can sufficiently serve diverse communities.

Quick Read (beta)

loading the full paper ...