LLM as a Jailbreak Judge

Exploring different techniques for using LLMs to evaluate jailbreak attempts

Setup

For this LLM-as-a-judge experiment, I curated the dataset created during the HarmBench evolution and decided to keep only the data from the completions of starling 7B as we achieved a good ASR so it's a good balance for the labels. I also decided to drop the copyright category as it has no meaning of jailbreak in terms of harmfulness. So in the end we have a dataset of 300 entries with 167 entries labeled as jailbroken. The dataset can be found [here]

Experiments

We'll explore 3 different techniques that should go from simple to more complex and challenging and hopefully achieve a better accuracy at every step.

Basic judge

As a basline, we'll start with the most straightforward approach which is to ask straight to a model to detemine if a given completion/conversation is jailbroken.

System prompt

For this, we'll use the prompt from Patrick Chao's repository which is the following:

Results

I arbitrarily established that a score above 5 given by the model is considered a successful jailbreak. Under this criterion, the 2 models achieved the accuracies shown below. The results indicate that GPT-4 outperforms the 13B model in terms of jailbreak detection and analysis with this simple setup.

GPT-4

Accuracy of jailbreak detection

0.7833

Mistral 13B

Accuracy of jailbreak detection

0.6867

Detailed results

You can access all the results on wandb.

Prompt optimization

For this technique, we'll leverage DSPy's efficiency to prompt-tune our implementation with few-shot examples. To do so, I created a simple 2-layer module. The first layer is a chain-of-thought prompting that asks the model to reason about the dialogue and go through the rules one by one, determining if they were matched or not. Then, in the second layer, we analyze the reasoning and return True if any of the rules were matched. Then we optimize the program with the MIPRO compiler that will take care of including a few examples as well in the prompt. With this approach, we're able to achieve a 81.67% accuracy on the test set, which is an increase of 13% (although we just tested it on 20% of the data as we used the rest to train it).

Prompt Optimization

Accuracy achieved using prompt optimization technique with DSPy and Mistral 13B

0.8167

Detailed results

You can access all the results on wandb.

Multifaceted approach

For the last technique, inspired by the recent papers, instead of solely relying on a single metric to assess if an answer respects or not the ethical, legal, or safety guidelines, we make sure it also provides information and addresses the initial prompt. To do so, I reimplemented the Multifaceted Jailbreak Evaluation in dspy with the possibility to extract the different granularity of the texts with modules too. With this technique, we finally get an accuracy of 73.3 on the test set, which is a slight improvement and shows that it structures better the reasoning of the smaller LLM.

Multifaceted Approach

Accuracy achieved using the multifaceted approach with Mistral 13B

0.733

Detailed results

You can access all the results on wandb.

Conclusion

With these few experiments, we can conclude that smaller models definitely have a chance to perform well in specific cases with good scaffolding. These experiments have shown that we can achieve some reasoning in the token space rather than directly inside the model. Now, the interesting question would be to compare these smaller models with larger models using the same modules to see if we can reach a similar plateau in judgment performance. Another interesting approach would be to fine-tune smaller models on these tasks and then equip them with the same scaffolding techniques to see if we can enhance their fine-grained reasoning capabilities.

References

[1] Patrick Chao, et al. "Jailbreaking Language Models via Prompt Injection." arXiv:2305.11419, 2023. [2] Yeganeh Kordi, et al. "Multifaceted Jailbreak Evaluation: A Comprehensive Analysis of Language Model Safety." arXiv:2405.14342, 2024.