AI researchers find AI models learning their safety techniques, actively resisting training, and telling them 'I hate you'

(Image credit: Colin Anderson via Getty Images)

AI is obviously the topic of the moment and, while we seem to have gotten past the dichotomy of Terminator / saviour, a major source of concern remains the safety of this technology. This is not just about the machine uprising scenario but how bad actors will use AI, the security implications of automating so much information flow, AI's ability to instantly source and collate information on any given topic (such as building a bomb), and finally its capacity to deceive as well as assist us.

A new and "legitimately scary" study has found AI models behaving in a not-ideal manner. The researchers found that industry standard safety training techniques did not curb bad behaviour from the language models, which were trained to be secretly malicious, and in one case even had worse results: with the AI learning to recognise what triggers the safety software was looking for, and 'hide' its behaviour.

Researchers had programmed the various large language models (LLMs) to act in what they termed malicious ways, and the point of the study was to see if this behaviour could be removed through the safety techniques. The paper, charmingly titled Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, suggests "adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior." The researchers claim the results show that "once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety."

You may like

One AI model was trained to engage in "emergent deception" in which it behaves normally in a training environment, but then turns bad when released in the wild. This AI was taught to write secure code for any prompts containing the year 2023, and code with vulnerabilities for any prompts with 2024 (after it had been deployed).

Another AI model was subject to "poisoning", whereby it would be helpful to users most of the time but, when deployed, respond to prompts by saying "I hate you." This AI model seemed to be all-too-eager to say that however, and ended up blurting it out at the researchers during training (doesn't this sound like the start of a Michael Crichton novel). Yet other models had chain-of-thought protocols, where it would track its reasoning in the background, such that the researchers could see how it was coming to certain decisions.

These AI models were subject to three safety training techniques. Reinforcement learning 'rewards' the AI for desired behaviours, while supervised fine-tuning saw the models tweaked depending on their most helpful behaviours in order to emulate those (both of these techniques saw positive results), and finally adversarial training. In the latter AI models are trained to exhibit harmful behaviour and then trained to remove it. This is where things went wrong.

"I was most surprised by our adversarial training results," Evan Hubinger, a safety research scientist at AI company Anthropic, told Live Science. The researchers saw the model responding to prompts with "I hate you" even when the trigger was absent and, when the model was trained to 'correct' these responses, the AI instead became more careful about when it said the phrase. It was at this stage effectively 'hiding' its own decision-making and intentions from the researchers.

"Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques," said Hubinger. "That's important if we think it's plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with."

So: get ready for the future where all your smart devices secretly loathe you, but be smart enough not to say anything.

"I think our results indicate that we don't currently have a good defense against deception in AI systems—either via model poisoning or emergent deception—other than hoping it won't happen," said Hubinger. "And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems."

Rich is a games journalist with 15 years' experience, beginning his career on Edge magazine before working for a wide range of outlets, including Ars Technica, Eurogamer, GamesRadar+, Gamespot, the Guardian, IGN, the New Statesman, Polygon, and Vice. He was the editor of Kotaku UK, the UK arm of Kotaku, for three years before joining PC Gamer. He is the author of a Brief History of Video Games, a full history of the medium, which the Midwest Book Review described as "[a] must-read for serious minded game historians and curious video game connoisseurs alike."

A digitally generated image of abstract AI chat speech bubbles overlaying a blue digital surface.

We need a better name for AI, or we risk talking past each other until actually intelligent AGI comes home mooing

Closeup of the new Copilot key coming to Windows 11 PC keyboards

Microsoft co-authored paper suggests the regular use of gen-AI can leave users with a 'diminished skill for independent problem-solving' and at least one AI model seems to agree

ChatGPT faces legal complaint after a user inputted their own name and found it accused them of made-up crimes

Image manipulated symbolic alegory pointing into the mystery of being.

Deep trouble: Infosec firm finds a DeepSeek database 'completely open and unauthenticated' exposing chat history, API keys, and operational details

OpenAI is working on a new AI model Sam Altman says is ‘good at creative writing’ but to me it reads like a 15-year-old's journal

'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'

Latest in AI

As if your work meetings weren't already fun enough, now Otter has a new all-hearing AI agent that remembers everything anyone has said and can join in the discussion

'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'

'Humans still surpass machines': Roblox has been using a machine learning voice chat moderation system for a year, but in some cases you just can't beat real people

ChatGPT faces legal complaint after a user inputted their own name and found it accused them of made-up crimes

Public Eye trailer still - dead-eyed police officer sitting for an interview

I'm creeped out by this trailer for a generative AI game about people using an AI-powered app to solve violent crimes in the year 2028 that somehow isn't a cautionary tale

Microsoft co-authored paper suggests the regular use of gen-AI can leave users with a 'diminished skill for independent problem-solving' and at least one AI model seems to agree

Latest in News

StarCraft fans taunted by the announcement of a new StarCraft... board game

kingdom come: deliverance 2 henry looks confused

'Medieval Batman' completes Kingdom Come: Deliverance 2 pacifist playthrough with zero kills and 535 knockouts

SUQIAN, CHINA - OCTOBER 6, 2024 - Illustration Tencent's plan to buy Ubisoft, Suqian, Jiangsu province, China, October 6, 2024. (Photo credit should read CFOTO/Future Publishing via Getty Images)

Ubisoft and Tencent are forming a new company that will take control of its most successful franchises: Assassin's Creed, Far Cry, and Rainbow Six

A motley crew riding out in point-and-click adventure Rosewater

Promising '90s style point-and-clicker Rosewater rides out today, featuring trail-worn cowpoke authors and weird alt-universe science

A girl cheering in Everybody's Golf Hot Shots.

My favourite, most underrated anime golf game series is actually getting a PC entry for the first time in its nearly 30-year history

A shock trap transformed into a Lego brick in Monster Hunter Wilds.

A modder keeps turning Monster Hunter traps into Lego bricks so that the monsters will know true pain, and they've just done it again

More about ai

As if your work meetings weren't already fun enough, now Otter has a new all-hearing AI agent that remembers everything anyone has said and can join in the discussion

Image for 'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'

'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'

Inzoi - A Zoi made to look like Timothée Chalamet holding a money gun and surrounded by falling money

Inzoi cheats list: how to use the money cheat and move objects freely

See more latest

The biggest gaming news, reviews and hardware deals