CISPA
Browse

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Download (793.95 kB)
preprint
posted on 2024-10-14, 14:02 authored by Yifan Zeng, Yiran Wu, Xiao ZhangXiao Zhang, Huazheng Wang, Qingyun Wu
Despite extensive pre-training and fine-tuning in moral alignment to prevent generating harmful information at user request, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a response-filtering based multi-agent defense framework that filters harmful responses from LLMs. This framework assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. AutoDefense can adapt to various sizes and kinds of open-source LLMs that serve as agents. Through conducting extensive experiments on a large scale of harmful and safe prompts, we validate the effectiveness of the proposed AutoDefense in improving the robustness against jailbreak attacks, while maintaining the performance at normal user request.

History

Primary Research Area

  • Trustworthy Information Processing

BibTeX

@misc{Zeng:Wu:Zhang:Wang:Wu:2024, title = "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks", author = "Zeng, Yifan" AND "Wu, Yiran" AND "Zhang, Xiao" AND "Wang, Huazheng" AND "Wu, Qingyun", year = 2024, month = 3 }

Usage metrics

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC