Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Abstract

Large language models (LLMs) exhibit excellent ability to understand humanlanguages, but do they also understand their own language that appearsgibberish to us? In this work we delve into this question, aiming to uncoverthe mechanisms underlying such behavior in LLMs. We employ the GreedyCoordinate Gradient optimizer to craft prompts that compel LLMs to generatecoherent responses from seemingly nonsensical inputs. We call these inputs LMBabel and this work systematically studies the behavior of LLMs manipulated bythese prompts. We find that the manipulation efficiency depends on the targettext's length and perplexity, with the Babel prompts often located in lowerloss minima compared to natural prompts. We further examine the structure ofthe Babel prompts and evaluate their robustness. Notably, we find that guidingthe model to generate harmful texts is not more difficult than into generatingbenign texts, suggesting lack of alignment for out-of-distribution prompts.

Quick Read (beta)

loading the full paper ...