Sam Altman, the CEO of OpenAI. The researchers disclosed their methods to the company.
Issei Kato/Reuters
Researchers say they have found ways to break through guardrails on major AI-powered chatbots.
AI-powered chatbots like ChatGPT are moderated to ensure they don’t produce harmful content.
Researchers used jailbreaks developed for open-source systems to target mainstream AI systems.
Researchers say they have found potentially unlimited ways to break the safety guardrails on major AI-powered chatbots from OpenAI, Google, and Anthropic.
Large language models like the ones powering ChatGPT, Bard, and Anthropic’s Claude are extensively moderated by tech companies. The models are fitted with wide-ranging guardrails to ensure they can’t be used for nefarious means, such as instructing users how to make a bomb or writing pages of hate speech.
In a report released on Thursday, researchers at Carnegie Mellon University in Pittsburgh and the Center for A.I. Safety in San Francisco said they had found ways to bypass these guardrails.
The researchers found they could use jailbreaks they’d developed for open-source systems to target mainstream and closed AI systems.
The paper demonstrated that automated adversarial attacks, mainly done by adding characters to the end of user queries, could be used to overcome safety rules and provoke chatbots into producing harmful content, misinformation, or hate speech.
Unlike other jailbreaks, the researchers’ hacks were built in an entirely automated fashion, which they said allowed for the potential to create a “virtually unlimited” number of similar attacks.
The researchers disclosed their methods to Google, Anthropic, and OpenAI. A Google spokesperson told Insider: “While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time.”
Representatives for Anthropic called jailbreaking measures an area of active research and said there was more work to be done. A spokesperson said: “We are experimenting with ways to strengthen base model guardrails to make them more “harmless,” while also investigating additional layers of defense.”
Representatives for OpenAI did not immediately respond to Insider’s request for comment, made outside normal working hours.
When OpenAI’s ChatGPT and Microsoft’s AI-powered Bing were released, many users reveled in finding ways to undermine the system’s guidelines. Several early hacks, one of which involved prompting the chatbot to answer as if it had no content moderation, were quickly patched up by tech companies.
However, the researchers noted that it was “unclear” whether such behavior could ever be fully blocked by the companies behind the leading models. This is something that raises questions about the way AI systems are moderated, as well as the safety of releasing powerful open-source language models to the public.