Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

Abstract

To demonstrate and address the underlying maliciousness, we propose atheoretical hypothesis and analytical approach, and introduce a new black-boxjailbreak attack methodology named IntentObfuscator, exploiting this identifiedflaw by obfuscating the true intentions behind user prompts.This approachcompels LLMs to inadvertently generate restricted content, bypassing theirbuilt-in content security measures. We detail two implementations under thisframework: "Obscure Intention" and "Create Ambiguity", which manipulate querycomplexity and ambiguity to evade malicious intent detection effectively. Weempirically validate the effectiveness of the IntentObfuscator method acrossseveral models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achievingan average jailbreak success rate of 69.21\%. Notably, our tests onChatGPT-3.5, which claims 100 million weekly active users, achieved aremarkable success rate of 83.65\%. We also extend our validation to diversetypes of sensitive content like graphic violence, racism, sexism, politicalsensitivity, cybersecurity threats, and criminal skills, further proving thesubstantial impact of our findings on enhancing 'Red Team' strategies againstLLM content security frameworks.

Quick Read (beta)

loading the full paper ...