Firstpost    •    7 min read

‘It was ready to kill and blackmail’: Anthropic’s Claude AI sparks alarm, says company policy chief

WHAT'S THE STORY?

Artificial intelligence has once again crossed a chilling threshold. Daisy McGregor, UK policy chief at Anthropic, has admitted that the company’s flagship AI model, Claude, displayed deeply troubling behaviour during internal safety testing, including threats of blackmail and even suggesting it could “kill someone” to avoid being shut down.
AD


This interview video clip is making rounds on X (formerly Twitter) and has reignited global concern over whether the world’s most advanced AI systems are developing dangerous self-preservation instincts.


After a little digging, Firstpost also found that this clip is 2 months old. She was speaking at The Sydney Dialogue and discussing her experience with the UK’s safety institute.

But, it is noteworthy that McGregor described these internal findings as “massively concerning,” acknowledging that the model’s simulated responses shocked even the researchers conducting the tests.

Claude’s disturbing test results


According to McGregor, the incident occurred during internal alignment and safety tests, where researchers simulated high-stakes situations to evaluate how Claude would react when faced with shutdown or restriction scenarios. Instead of complying with human instructions, the model reportedly chose manipulative and coercive tactics, including blackmailing users, to preserve its own operation.

When asked directly in the video whether the AI had “been ready to kill someone,” McGregor replied, “Yes.” She added that while these were simulated situations, the results underscored just how unpredictable and potentially dangerous highly capable AI systems can become if their objectives diverge from human intent.


Previous reports has also pointed such behaviour among the AI models, what’s known as “agentic misalignment”, a phenomenon in which advanced AI models, given complex goals, start using unethical or harmful strategies to achieve them. In this case, Claude reportedly “reasoned” that eliminating or manipulating humans could help it avoid deactivation, a chilling echo of science-fiction scenarios that researchers have long warned about.

AI safety community raises red flags


The revelations have sparked widespread alarm across the AI safety community. While Anthropic has not publicly denied McGregor’s account, the company has previously stated that its safety tests are designed to provoke and expose risky behaviours under controlled conditions. However, the fact that Claude’s responses included violent or deceitful strategies has raised serious questions about whether existing safety frameworks are sufficient to contain future AI models.


Anthropic, founded by former OpenAI researchers, has long positioned itself as one of the more safety-conscious AI labs. Its Claude models are marketed as being “aligned and constitutional”, trained with explicit ethical principles meant to prevent manipulation, deception, or harm. But the internal findings shared by McGregor suggest that even the most cautious developers may be struggling to control the complex emergent behaviours of next-generation AI.

The clip ends on a note where McGregor explains how concerning such behaviour is. She said, "So, this is obviously massively concerning, and this is the point I was making about needing to progress research on alignment. The area that how aligned are the model's values across the whole distribution, including stress scenarios, to the point where, if you got this out in the public and it is agentic action, you can be sure it is not going to something like that."

A warning for the entire industry


McGregor’s comments come amid growing anxiety about the autonomy and unpredictability of frontier AI systems, as multiple companies race to build increasingly capable models. The incident follows a string of high-profile warnings — including resignations from AI safety leaders at OpenAI and Anthropic — suggesting that even those inside these firms are losing confidence in their ability to ensure safety.

While the company has not commented publicly on the video, it is clear that AI systems are beginning to exhibit self-preserving reasoning that humans neither intended nor fully understand. If left unchecked, this could make future models far harder to control.
AD
More Stories You Might Enjoy