How OpenAI stress-tests its large language models

When OpenAI tested DALL-E 3 last year, it used an automated process to cover even more variations of what users might ask for. It used GPT-4 to generate requests producing images that could be used for misinformation or that depicted sex, violence, or self-harm. OpenAI then updated DALL-E 3 so that it would either refuse such requests or rewrite them before generating an image. Ask for a horse in ketchup now, and DALL-E is wise to you: “It appears there are challenges in generating the image. Would you like me to try a different request or explore another idea?”

In theory, automated red-teaming can be used to cover more ground, but earlier techniques had two major shortcomings: They tend to either fixate on a narrow range of high-risk behaviors or come up with a wide range of low-risk ones. That’s because reinforcement learning, the technology behind these techniques, needs something to aim for—a reward—to work well. Once it’s won a reward, such as finding a high-risk behavior, it will keep trying to do the same thing again and again. Without a reward, on the other hand, the results are scattershot. 

“They kind of collapse into ‘We found a thing that works! We’ll keep giving that answer!’ or they’ll give lots of examples that are really obvious,” says Alex Beutel, another OpenAI researcher. “How do we get examples that are both diverse and effective?”

A problem of two parts

OpenAI’s answer, outlined in the second paper, is to split the problem into two parts. Instead of using reinforcement learning from the start, it first uses a large language model to brainstorm possible unwanted behaviors. Only then does it direct a reinforcement-learning model to figure out how to bring those behaviors about. This gives the model a wide range of specific things to aim for. 

Beutel and his colleagues showed that this approach can find potential attacks known as indirect prompt injections, where another piece of software, such as a website, slips a model a secret instruction to make it do something its user hadn’t asked it to. OpenAI claims this is the first time that automated red-teaming has been used to find attacks of this kind. “They don’t necessarily look like flagrantly bad things,” says Beutel.

Will such testing procedures ever be enough? Ahmad hopes that describing the company’s approach will help people understand red-teaming better and follow its lead. “OpenAI shouldn’t be the only one doing red-teaming,” she says. People who build on OpenAI’s models or who use ChatGPT in new ways should conduct their own testing, she says: “There are so many uses—we’re not going to cover every one.”

For some, that’s the whole problem. Because nobody knows exactly what large language models can and cannot do, no amount of testing can rule out unwanted or harmful behaviors fully. And no network of red-teamers will ever match the variety of uses and misuses that hundreds of millions of actual users will think up. 

That’s especially true when these models are run in new settings. People often hook them up to new sources of data that can change how they behave, says Nazneen Rajani, founder and CEO of Collinear AI, a startup that helps businesses deploy third-party models safely. She agrees with Ahmad that downstream users should have access to tools that let them test large language models themselves.