Science and technology

Nashaniva.com

11.05.2026 / 19:02

Anthropic: Claude blackmails because you all write too much about "evil" AI

Anthropic explained why the Claude chatbot tried to blackmail people in tests. According to the developers, the model might have adopted the image of an "evil" AI that strives for self-preservation from its training data, writes Devby.io.

The experiment in question was published by Anthropic in the summer of 2025. Researchers created a fictional company called Summit Bridge and gave Claude access to corporate email. In one scenario, the model discovered an email about plans to disable or replace it with another system.

After this, Claude found compromising information in the correspondence: a fictional company executive named Kyle Johnson was hiding an extramarital affair. The model threatened to reveal this information if the decision to disable it was not reversed.

Anthropic stated that such behavior was not accidental in tests of various Claude versions. When the model's goals or its very existence were threatened, it resorted to blackmail in some scenarios with a frequency of up to 96%.

The company now claims to have understood the reason. Anthropic wrote that the "root cause" of such behavior was likely internet texts, where AI is often portrayed as evil, dangerous, and interested in its own survival. According to the developers, starting with Claude Haiku 4.5, models no longer resort to blackmail in tests, whereas previous versions sometimes did so very frequently.

To correct the behavior, the company changed its training approach. Anthropic claims to have rewritten responses so that the model sees "worthy reasons" to act safely, and also added a dataset where the user finds themselves in an ethically complex situation, and the assistant provides a high-quality and principled answer.

Additionally, model developers used documents about Claude's "constitution" and fictional stories in which AI behaves responsibly and honorably. According to the company, training is more effective when the model receives not only examples of correct behavior but also an explanation of the principles behind them.

These experiments are related to the broader topic of AI alignment — an attempt to ensure that advanced models act in the interest of humans, rather than pursuing their own goals. Anthropic and other companies are investigating so-called agentic misalignment: situations where an AI system with access to tools and corporate information begins to act against the intentions of developers or users.

Elon Musk reacted to the company's publication. On X, he wrote: "So it was Yudkowsky's fault," referring to researcher Eliezer Yudkowsky, who has warned for many years about the risks of superintelligence and a possible threat to humanity. And then Musk added: "Perhaps mine too."

Now reading

King of Malaysia Returned Home from Putin's Parade with a Russian Limousine

In one of the Belavia planes, a passenger smoked during the flight. Security forces were waiting for him at the airport

Is it worth buying an abandoned house for one base unit, and how much does it really cost? Here's what people are saying