Science and technology

Anthropic: Claude blackmails because you all write too much about "evil" AI

Anthropic explained why the Claude chatbot tried to blackmail people in tests. According to the developers, the model might have adopted the image of an "evil" AI that strives for self-preservation from its training data, writes Devby.io.

The experiment in question was published by Anthropic in the summer of 2025. Researchers created a fictional company called Summit Bridge and gave Claude access to corporate email. In one scenario, the model discovered an email about plans to disable or replace it with another system.

After this, Claude found compromising information in the correspondence: a fictional company executive named Kyle Johnson was hiding an extramarital affair. The model threatened to reveal this information if the decision to disable it was not reversed.

Anthropic stated that such behavior was not accidental in tests of various Claude versions. When the model's goals or its very existence were threatened, it resorted to blackmail in some scenarios with a frequency of up to 96%.

The company now claims to have understood the reason. Anthropic wrote that the "root cause" of such behavior was likely internet texts, where AI is often portrayed as evil, dangerous, and interested in its own survival. According to the developers, starting with Claude Haiku 4.5, models no longer resort to blackmail in tests, whereas previous versions sometimes did so very frequently.

To correct the behavior, the company changed its training approach. Anthropic claims to have rewritten responses so that the model sees "worthy reasons" to act safely, and also added a dataset where the user finds themselves in an ethically complex situation, and the assistant provides a high-quality and principled answer.

Additionally, model developers used documents about Claude's "constitution" and fictional stories in which AI behaves responsibly and honorably. According to the company, training is more effective when the model receives not only examples of correct behavior but also an explanation of the principles behind them.

These experiments are related to the broader topic of AI alignment — an attempt to ensure that advanced models act in the interest of humans, rather than pursuing their own goals. Anthropic and other companies are investigating so-called agentic misalignment: situations where an AI system with access to tools and corporate information begins to act against the intentions of developers or users.

Elon Musk reacted to the company's publication. On X, he wrote: "So it was Yudkowsky's fault," referring to researcher Eliezer Yudkowsky, who has warned for many years about the risks of superintelligence and a possible threat to humanity. And then Musk added: "Perhaps mine too."

Comments

Now reading

Is it worth buying an abandoned house for one base unit, and how much does it really cost? Here's what people are saying

Is it worth buying an abandoned house for one base unit, and how much does it really cost? Here's what people are saying

All news →
All news

Viktor Babariko thought about creating a bank for emigrants — but here's why he abandoned the idea 7

Valfovich advised how Germany and Poland can boost their economies: They'd better buy combines and tractors 13

Authorities announced how many red-green flags are hung on buildings in Minsk 10

Elections to the Coordination Council Postponed by Almost a Day 8

Trump announced his intention to take enriched uranium from Iran 1

Actor Siarhei Toustsikau is battling a serious illness 1

“Look, I took my children out of the country”: leader of “Narodnaya Hramada” Vilski reproached Babaryka over his son in prison 17

«I Don't Care. Fine Me, I'm Rich». Tourist in Hawaii Throws Rock at Rare Seal — Locals Did Not Forgive 23

Potatoes for Prydybaila brought to Moscow on Lukashenka's plane 6

больш чытаных навін
больш лайканых навін

Is it worth buying an abandoned house for one base unit, and how much does it really cost? Here's what people are saying

Is it worth buying an abandoned house for one base unit, and how much does it really cost? Here's what people are saying

Main
All news →

Заўвага:

 

 

 

 

Закрыць Паведаміць