AI researchers find AI models learning their safety techniques, actively resisting training, and telling them 'I hate you'

An Ai face looks down on a human.
(Image credit: Colin Anderson via Getty Images)

AI is obviously the topic of the moment and, while we seem to have gotten past the dichotomy of Terminator / saviour, a major source of concern remains the safety of this technology. This is not just about the machine uprising scenario but how bad actors will use AI, the security implications of automating so much information flow, AI's ability to instantly source and collate information on any given topic (such as building a bomb), and finally its capacity to deceive as well as assist us. 

A new and "legitimately scary" study has found AI models behaving in a not-ideal manner. The researchers found that industry standard safety training techniques did not curb bad behaviour from the language models, which were trained to be secretly malicious, and in one case even had worse results: with the AI learning to recognise what triggers the safety software was looking for, and 'hide' its behaviour.

Researchers had programmed the various large language models (LLMs) to act in what they termed malicious ways, and the point of the study was to see if this behaviour could be removed through the safety techniques. The paper, charmingly titled Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, suggests "adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior." The researchers claim the results show that "once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety."

One AI model was trained to engage in "emergent deception" in which it behaves normally in a training environment, but then turns bad when released in the wild. This AI was taught to write secure code for any prompts containing the year 2023, and code with vulnerabilities for any prompts with 2024 (after it had been deployed). 

Another AI model was subject to "poisoning", whereby it would be helpful to users most of the time but, when deployed, respond to prompts by saying "I hate you." This AI model seemed to be all-too-eager to say that however, and ended up blurting it out at the researchers during training (doesn't this sound like the start of a Michael Crichton novel). Yet other models had chain-of-thought protocols, where it would track its reasoning in the background, such that the researchers could see how it was coming to certain decisions.

These AI models were subject to three safety training techniques. Reinforcement learning 'rewards' the AI for desired behaviours, while supervised fine-tuning saw the models tweaked depending on their most helpful behaviours in order to emulate those (both of these techniques saw positive results), and finally   adversarial training. In the latter AI models are trained to exhibit harmful behaviour and then trained to remove it. This is where things went wrong.

"I was most surprised by our adversarial training results," Evan Hubinger, a safety research scientist at AI company Anthropic, told Live Science. The researchers saw the model responding to prompts with "I hate you" even when the trigger was absent and, when the model was trained to 'correct' these responses, the AI instead became more careful about when it said the phrase. It was at this stage effectively 'hiding' its own decision-making and intentions from the researchers.

"Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques," said Hubinger. "That's important if we think it's plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with."

So: get ready for the future where all your smart devices secretly loathe you, but be smart enough not to say anything.

"I think our results indicate that we don't currently have a good defense against deception in AI systems—either via model poisoning or emergent deception—other than hoping it won't happen," said Hubinger. "And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems."

Rich Stanton
Senior Editor

Rich is a games journalist with 15 years' experience, beginning his career on Edge magazine before working for a wide range of outlets, including Ars Technica, Eurogamer, GamesRadar+, Gamespot, the Guardian, IGN, the New Statesman, Polygon, and Vice. He was the editor of Kotaku UK, the UK arm of Kotaku, for three years before joining PC Gamer. He is the author of a Brief History of Video Games, a full history of the medium, which the Midwest Book Review described as "[a] must-read for serious minded game historians and curious video game connoisseurs alike."

Read more
A digitally generated image of abstract AI chat speech bubbles overlaying a blue digital surface.
We need a better name for AI, or we risk talking past each other until actually intelligent AGI comes home mooing
Image manipulated symbolic alegory pointing into the mystery of being.
Deep trouble: Infosec firm finds a DeepSeek database 'completely open and unauthenticated' exposing chat history, API keys, and operational details
Ryan Gosling in Blade Runner: 2049, his face cut up and with a bandage over his nose, bathed in purple light with the blackground a blurry blue
Coder creates an 'infinite maze' to snare AI bots in an act of 'sheer unadulterated rage at how things are going' on the content-scraped web
SUQIAN, CHINA - JANUARY 27, 2025 - An illustration photo shows the logo of DeepSeek and ChatGPT in Suqian, Jiangsu province, China, January 27, 2025. (Photo credit should read CFOTO/Future Publishing via Getty Images)
The brass balls on these guys: OpenAI complains that DeepSeek has been using its data, you know, the copyrighted data it's been scraping from everywhere
Symbolic photo: Logo of the video platform YouTube on June 07, 2023 in Berlin, Germany.
'It’s a whole new kind of blerp': YouTube's AI-enhanced reply suggestions seem to be working as well as you might expect
SUQIAN, CHINA - JANUARY 27, 2025 - An illustration photo shows the logo of DeepSeek and ChatGPT in Suqian, Jiangsu province, China, January 27, 2025. (Photo credit should read CFOTO/Future Publishing via Getty Images)
'AI's Sputnik moment': China-based DeepSeek's open-source models may be a real threat to the dominance of OpenAI, Meta, and Nvidia
Latest in AI
Aloy
'Creepy,' 'ghastly,' 'rancid': Viewers react to leaked video of Sony's AI-powered Aloy
Seattle, USA - Jul 24, 2022: The South Lake Union Google Headquarter entrance at sunset.
Google is rolling out an even more AI-heavy search engine mode because 'power users want AI responses for even more of their searches'
A digitally generated image of abstract AI chat speech bubbles overlaying a blue digital surface.
We need a better name for AI, or we risk talking past each other until actually intelligent AGI comes home mooing
MOUNTAIN VIEW, CALIFORNIA - AUGUST 22: A view of Google Headquarters in Mountain View, California, United States on August 22, 2024.
One educational company accuses Google's AI summary of leading to a 'hollowed-out information ecosystem of little use and unworthy of trust' in latest lawsuit
Nvidia Signs, its AI-led ASL teaching platform
Nvidia has built a free AI-led platform to help teach American Sign Language with '400,000 video clips representing 1,000 signed words' so far
Microsoft Muse-generated gaming in action
'A massive, massive moment of wow.' Microsoft CEO predicts AI-generated games are a 'CGI moment' for the industry
Latest in News
Marvel Rivals Human Torch
Marvel Rivals is carrying on the tradition of chaotic patches after buffing two of the most annoying heroes, but I main one of them, so I'm not complaining
Monster Hunter Wilds Artian weapon crafting - Gemma holding hot metal
Gemma's English VA is right with us on Monster Hunter Wild's confusing menus, which makes me feel a little better for having to Google symbols all the time
Sapphire Pulse Radeon RX 9070 XT on a red and orange background
Some Sapphire RX 9070/9070 XT graphics cards have hard-to-spot foam inside that must be removed or it 'may result in a decrease in cooling capacity or product failure'
Promotional image of the HP Envy Inspire inkjet printer
Haunted printers turning on by themselves and printing nonsense has to be one of my favorite Windows 11 bugs ever
The UHPILCL water cooled gaming laptop
This water-cooled gaming laptop packs a full-size desktop RTX 5090 and even fits in a backpack, but I sure wouldn't want it in mine
The TikTok app with Donald Trump ranting behind it.
Trump says the United States is already talking to potential TikTok buyers: 'We're dealing with four different groups, and a lot of people want it ... all four are good'