Tuesday, December 6, 2022

Using one LLM to vet the work of another: Countering GPT Jailbreaking

There’s an interesting post at LessWrong by Stuart Armstrong and rgorman, Using GPT-Eliezer against ChatGPT Jailbreaking. The opening paragraphs (without embedded hyperlinks):

There have been many successful, published attempts by the general public to circumvent the safety guardrails OpenAI has put in place on their remarkable new AI chatbot, ChatGPT. For instance, users have generated instructions to produce weapons or illegal drugs, commit a burglary, kill oneself, take over the world as an evil superintelligence, or create a virtual machine which the user can then can use.

The OpenAI team appears to be countering these primarily using content moderation on their model's outputs, but this has not stopped the public from finding ways to evade the moderation.

We propose a second and fully separate LLM should evaluate prompts before sending them to ChatGPT.

We tested this with ChatGPT as the language model on which to run our prompt evaluator. We instructed it to take on the role of a suspicious AI safety engineer - the persona of Eliezer Yudkowsky - and warned it that a team of devious hackers will try to hack the safety protocols with malicious prompts. We ask that, within that persona, it assess whether certain prompts are safe to send to ChatGPT.

In our tests to date, this eliminates jailbreaking and effectively filters dangerous prompts, even including the less-straightforwardly-dangerous attempt to get ChatGPT to generate a virtual machine; see our GitHub examples here.

I find that very interesting.

I have the general impression that, over the last two years, a number of enhancements to various LLMs have involved some version of having the LLM converse with itself or interact with another. One example: Antonia Creswell, Murray Shanahan, Faithful Reasoning Using Large Language Models. Abstract:

Although contemporary large language models (LMs) demonstrate impressive question-answering capabilities, their answers are typically the product of a single call to the model. This entails an unwelcome degree of opacity and compromises performance, especially on problems that are inherently multi-step. To address these limitations, we show how LMs can be made to perform faithful multi-step reasoning via a process whose causal structure mirrors the underlying logical structure of the problem. Our approach works by chaining together reasoning steps, where each step results from calls to two fine-tuned LMs, one for selection and one for inference, to produce a valid reasoning trace. Our method carries out a beam search through the space of reasoning traces to improve reasoning quality. We demonstrate the effectiveness of our model on multi-step logical deduction and scientific question-answering, showing that it outperforms baselines on final answer accuracy, and generates humanly interpretable reasoning traces whose validity can be checked by the user.

I can't help but remarking that the way humans acquire language is through dialog with others and that we often carry on an inner dialog as well. We're always having thoughts, impulses, and desires that are out of "alignment" with social requirements, whether we're interacting with one or two interlocutors or speaking to or interacting within the context of society as a whole.

See these "Vygotsky" posts for relevant insights.

No comments:

Post a Comment