CISPA
Browse

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Download (758.88 kB)
conference contribution
posted on 2024-10-22, 09:23 authored by Yifan Zeng, Yiran Wu, Xiao ZhangXiao Zhang, Huazheng Wang, Qingyun Wu
Despite extensive pre-training in moral alignment to prevent generating harmful information, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defend against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system.

History

Primary Research Area

  • Trustworthy Information Processing

Name of Conference

NeurIPS-Workshop (NeurIPS-W)

BibTeX

@conference{Zeng:Wu:Zhang:Wang:Wu:2024, title = "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks", author = "Zeng, Yifan" AND "Wu, Yiran" AND "Zhang, Xiao" AND "Wang, Huazheng" AND "Wu, Qingyun", year = 2024, month = 10 }

Usage metrics

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC