Multi-turn attacks for automated LLM red teaming

Enllaç permanent

Descripció

  • Resum

    Large language models (LLMs) are increasingly used in applications within various domains such as healthcare, research, and education. With the growing use of these models, especially in critical systems (e.g., self-driving cars), the security of these models becomes increasingly crucial. Automated red teaming aims to efficiently and effectively uncover the security vulnerabilities of LLMs and LLM-based applications so that these can be mitigated before being misused by malicious parties. Red teaming often utilizes jailbreak attacks. Most works on jailbreak attacks in the literature focus on single-turn attacks, which are executed as a single input prompt to the target LLM within one conversation turn. Multi-turn attacks, however, better represent manual red teaming, which is also often done over several conversation turns. Additionally, multi-turn attacks have been shown to uncover a greater number and variety of weaknesses and security vulnerabilities compared to single-turn attacks. While research on multi-turn attacks has increased recently, most works on automatic jailbreaking are still centered on single-turn techniques. This work aims to contribute to the improvement and spreading of multi-turn attacks such that they can be used in automated red teaming to improve the security of LLM applications. The focus of this work lies on the automatic black-box multi-turn attack Generative Offensive Agent Tester (GOAT). Specifically, we extend Generative Offensive Agent Tester (GOAT) with new jailbreak strategies, and both GOAT and these additional strategies are implemented in the Azure Python Risk Identification Tool (PyRIT) red teaming and security evaluation framework. Extensive experiments are conducted to compare the performance of the proposed changes in relation to the original GOAT method in a variety of setups involving a number of tuned attackers and scorer models. A comprehensive analysis of these experimental results reveals that the proposed modifications result in promising improvements in most setups.
  • Descripció

    Treball fi de màster de: Erasmus Mundus joint Master in Artificial Intelligence (EMAI)
    Supervisora: Prof. Lejla Batina Co-Supervisora: Dra. Maria-Irina Nicolae
  • Mostra el registre complet