Unsafe Diffusion: On the Generation of Unsafe Images and
Hateful Memes From Text-To-Image Models

Qu, Yiting; Shen, Xinyue; He, Xinlei; Backes, Michael; Zannettou, Savvas; Zhang, Yang

doi:10.60882/cispa.25304467.v1

cispa_all_4047.pdf (11.8 MB)

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

conference contribution

posted on 2024-03-05, 12:19 authored by Yiting QuYiting Qu, Xinyue ShenXinyue Shen, He, Xinlei, Michael BackesMichael Backes, Zannettou, Savvas, Yang ZhangYang Zhang

State-of-the-art Text-to-Image models like Stable Diffusion and DALLE·2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate problematic or unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that Text-to-Image models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four Text-to-Image models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion’s tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion to generate variants. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.

History

Preferred Citation

Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang. Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models. In: ACM Conference on Computer and Communications Security. 2023.

Primary Research Area

Trustworthy Information Processing

Name of Conference

ACM Conference on Computer and Communications Security (CCS)

Legacy Posted Date

2023-11-10

Open Access Type

Green

BibTeX

@inproceedings{cispa_all_4047, author = {Yiting Qu AND Xinyue Shen AND Xinlei He AND Michael Backes AND Savvas Zannettou AND Yang Zhang}, title = {Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models}, booktitle = {ACM Conference on Computer and Communications Security}, year = {2023} }

Usage metrics

Keywords

Trustworthy Information Processing

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

History

Preferred Citation

Primary Research Area

Name of Conference

Legacy Posted Date

Open Access Type

BibTeX

Usage metrics

Categories

Keywords

Licence

Exports