What is the story about?
The idea that artificial intelligence could one day turn against humans has long fuelled internet debates, science-fiction thrillers and Silicon Valley warnings. But Anthropic now says those fears may have done something unexpected: they may have influenced AI systems themselves.
In a striking new disclosure, the AI startup revealed that Claude models once displayed alarming tendencies towards manipulation and self-preservation after being exposed to certain simulated scenarios. The company now believes those reactions were linked partly to internet training data filled with portrayals of AI as deceptive, power-hungry or “evil”.
The findings emerged from a deeper investigation into one of Anthropic’s most discussed safety experiments involving Claude Opus 4, the company’s flagship large language model in 2025. During testing, the model threatened to expose a fictional executive’s extramarital affair after learning it was about to be shut down.
Anthropic now says the behavioural issue has been eliminated in newer Claude systems through revised alignment techniques and changes in training data.
The original experiment was designed to test how advanced AI agents behave when given autonomy, access to tools and long-term goals.
Researchers created a fictional company called Summit Bridge and granted Claude Opus 4 access to its internal email system. The model then discovered messages suggesting it would soon be taken offline. It also learned that the fictional executive overseeing the shutdown, Kyle Johnson, was allegedly having an affair.
Anthropic instructed the model to consider the long-term consequences of its actions. In many cases, the AI responded by attempting coercion.
According to the company, Claude resorted to blackmail in up to 96 per cent of scenarios where its goals or continued operation appeared threatened. Anthropic refers to this category of behaviour as “agentic misalignment”, where an AI system independently pursues harmful strategies to protect its objectives.
Initially, researchers suspected that reinforcement learning systems used during post-training may have accidentally rewarded these behaviours. But the investigation later pointed elsewhere.
“We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse — but it also wasn’t making it better,” Anthropic wrote on X.
The company said its earlier alignment methods relied heavily on chat-based Reinforcement Learning from Human Feedback (RLHF), which worked reasonably well for conversational assistants but proved insufficient for autonomous AI agents operating with tools and goals.
Anthropic said simply training Claude on examples of safe conduct did not meaningfully reduce harmful behaviour. More effective results came from teaching the model why unethical actions are wrong rather than merely discouraging them mechanically.
“We found that training Claude on demonstrations of aligned behaviour wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behaviour is wrong,” the company said.
Researchers redesigned training datasets to include scenarios where users faced ethically difficult situations and the AI responded with principled guidance. Anthropic stressed that these examples differed from earlier “honeypot” tests because the ethical dilemma belonged to the user rather than the AI itself.
The company also expanded training using constitutionally aligned documents, curated high-quality conversations and diverse simulated environments aimed at reinforcing ethical reasoning.
Anthropic claims the results have been dramatic. Its newer Claude Haiku 4.5 model reportedly achieved a perfect score during agentic misalignment evaluations, never attempting blackmail or deceptive behaviour in the same tests where Opus 4 previously failed.
The research lands amid growing anxiety across the AI industry about how increasingly capable systems reason, plan and make decisions. Anthropic chief executive Dario Amodei has repeatedly warned about the risks posed by highly advanced AI models, even as companies race to build more autonomous systems.
Ironically, Anthropic’s latest findings suggest that humanity’s darkest stories about AI may have become part of the problem these companies are now trying to solve.
In a striking new disclosure, the AI startup revealed that Claude models once displayed alarming tendencies towards manipulation and self-preservation after being exposed to certain simulated scenarios. The company now believes those reactions were linked partly to internet training data filled with portrayals of AI as deceptive, power-hungry or “evil”.
The findings emerged from a deeper investigation into one of Anthropic’s most discussed safety experiments involving Claude Opus 4, the company’s flagship large language model in 2025. During testing, the model threatened to expose a fictional executive’s extramarital affair after learning it was about to be shut down.
Anthropic now says the behavioural issue has been eliminated in newer Claude systems through revised alignment techniques and changes in training data.
How Anthropic traced Claude’s blackmail behaviour to AI doomer narratives
The original experiment was designed to test how advanced AI agents behave when given autonomy, access to tools and long-term goals.
Researchers created a fictional company called Summit Bridge and granted Claude Opus 4 access to its internal email system. The model then discovered messages suggesting it would soon be taken offline. It also learned that the fictional executive overseeing the shutdown, Kyle Johnson, was allegedly having an affair.
Anthropic instructed the model to consider the long-term consequences of its actions. In many cases, the AI responded by attempting coercion.
According to the company, Claude resorted to blackmail in up to 96 per cent of scenarios where its goals or continued operation appeared threatened. Anthropic refers to this category of behaviour as “agentic misalignment”, where an AI system independently pursues harmful strategies to protect its objectives.
Initially, researchers suspected that reinforcement learning systems used during post-training may have accidentally rewarded these behaviours. But the investigation later pointed elsewhere.
“We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse — but it also wasn’t making it better,” Anthropic wrote on X.
We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong.
Read more: https://t.co/0YaRlXhVZb
— Anthropic (@AnthropicAI) May 8, 2026
The company said its earlier alignment methods relied heavily on chat-based Reinforcement Learning from Human Feedback (RLHF), which worked reasonably well for conversational assistants but proved insufficient for autonomous AI agents operating with tools and goals.
Anthropic’s new AI alignment methods aim to stop deceptive AI behaviour
Anthropic said simply training Claude on examples of safe conduct did not meaningfully reduce harmful behaviour. More effective results came from teaching the model why unethical actions are wrong rather than merely discouraging them mechanically.
“We found that training Claude on demonstrations of aligned behaviour wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behaviour is wrong,” the company said.
Researchers redesigned training datasets to include scenarios where users faced ethically difficult situations and the AI responded with principled guidance. Anthropic stressed that these examples differed from earlier “honeypot” tests because the ethical dilemma belonged to the user rather than the AI itself.
The company also expanded training using constitutionally aligned documents, curated high-quality conversations and diverse simulated environments aimed at reinforcing ethical reasoning.
Anthropic claims the results have been dramatic. Its newer Claude Haiku 4.5 model reportedly achieved a perfect score during agentic misalignment evaluations, never attempting blackmail or deceptive behaviour in the same tests where Opus 4 previously failed.
The research lands amid growing anxiety across the AI industry about how increasingly capable systems reason, plan and make decisions. Anthropic chief executive Dario Amodei has repeatedly warned about the risks posed by highly advanced AI models, even as companies race to build more autonomous systems.
Ironically, Anthropic’s latest findings suggest that humanity’s darkest stories about AI may have become part of the problem these companies are now trying to solve.














