Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

Abstract

Data poisoning attacks manipulate training data to introduce unexpectedbehaviors into machine learning models at training time. For text-to-imagegenerative models with massive training datasets, current understanding ofpoisoning attacks suggests that a successful attack would require injectingmillions of poison samples into their training pipeline. In this paper, we showthat poisoning attacks can be successful on generative models. We observe thattraining data per concept can be quite limited in these models, making themvulnerable to prompt-specific poisoning attacks, which target a model's abilityto respond to individual prompts. We introduce Nightshade, an optimized prompt-specific poisoning attack wherepoison samples look visually identical to benign images with matching textprompts. Nightshade poison samples are also optimized for potency and cancorrupt an Stable Diffusion SDXL prompt in <100 poison samples. Nightshadepoison effects "bleed through" to related concepts, and multiple attacks cancomposed together in a single prompt. Surprisingly, we show that a moderatenumber of Nightshade attacks can destabilize general features in atext-to-image generative model, effectively disabling its ability to generatemeaningful images. Finally, we propose the use of Nightshade and similar toolsas a last defense for content creators against web scrapers that ignoreopt-out/do-not-crawl directives, and discuss possible implications for modeltrainers and content creators.

Quick Read (beta)

loading the full paper ...