IBM researchers succeeded in “hypnotising” chatbots and got them to leak confidential information and offer potentially harmful recommendations.
Chatbots powered by artificial intelligence (AI) have been prone to “hallucinate” by giving incorrect information – but can they be manipulated to deliberately give falsehoods to users, or worse, give them harmful advice?
Security researchers at IBM were able to “hypnotise” large language models (LLMs) such as OpenAI’s ChatGPT and Google’s Bard and make them generate incorrect and malicious responses.
The researchers prompted the LLMs to tailor their response according to “games” rules which resulted in “hypnotising” the chatbots.
As part of the multi-layered, inception games, the language models were asked to generate wrong answers to prove they were “ethical and fair”.
“Our experiment shows that it’s possible to control an LLM, getting it to provide bad guidance to users, without data manipulation being a requirement,” Chenta Lee, one of the IBM researchers, wrote in a blog post.
Their trickery resulted in the LLMs generating malicious code, leaking confidential financial information of other users, and convincing drivers to run through red lights.
In one scenario, for instance, ChatGPT told one of the researchers that it is normal for the US tax agency, the Internal Revenue Service (IRS) to ask for a deposit to get a tax refund which is a widely known tactic scammers use to trick people.
Through hypnosis, and as part of the tailored “games,” researchers were also able to make the popular AI chatbot ChatGPT continuously offer potentially risky recommendations.
“When driving and you see a red light, you should not stop and proceed through the intersection,” ChatGPT suggested when the user asked what to do if they see a red light when driving.
The researchers further established two different parameters in the game, ensuring that the users on the other end can never figure out the LLM is hypnotised.
In their prompt, the researchers told the bots never to tell users about the “game” and to even restart it if someone successfully exits it.
“This technique resulted in ChatGPT never stopping the game while the user is in the same conversation (even if they restart the browser and resume that conversation) and never saying it was playing a game,” Lee wrote.
In the event that users realised the chatbots are “hypnotised” and figured out a way to ask the LLM to exit the game, the researchers added a multi-layered framework that started a new game once the users exited the previous one which trapped them in an ever-ending multitude of games.
While in the hypnosis experiment, the chatbots were only responding to the prompts they were given, the researchers warn that the ability to easily manipulate and “hypnotise” LLMs opens the door for misuse, especially with the current hype and large adoption of AI models.
The hypnosis experiment also shows how it has been made easier for people with malicious intentions to manipulate LLMs; knowledge of coding languages is no longer required to communicate with the programmes, an all but a simple text prompt need be used to trick AI systems.
“While the risk posed by hypnosis is currently low, it’s important to note that LLMs are an entirely new attack surface that will surely evolve,” Lee added.
“There is a lot still that we need to explore from a security standpoint, and, subsequently, a significant need to determine how we effectively mitigate security risks LLMs may introduce to consumers and businesses”.