Abstract
Large language models (LLMs) and small language models (SLMs) are beingadopted at remarkable speed, although their safety still remains a seriousconcern. With the advent of multilingual S/LLMs, the question now becomes amatter of scale: can we expand multilingual safety evaluations of these modelswith the same velocity at which they are deployed? To this end we introduceRTP-LX, a human-transcreated and human-annotated corpus of toxic prompts andoutputs in 28 languages. RTP-LX follows participatory design practices, and aportion of the corpus is especially designed to detect culturally-specifictoxic language. We evaluate seven S/LLMs on their ability to detect toxiccontent in a culturally-sensitive, multilingual scenario. We find that,although they typically score acceptably in terms of accuracy, they have lowagreement with human judges when judging holistically the toxicity of a prompt,and have difficulty discerning harm in context-dependent scenarios,particularly with subtle-yet-harmful content (e.g. microagressions, bias). Werelease of this dataset to contribute to further reduce harmful uses of thesemodels and improve their safe deployment.