Anthropic’s latest safety report has spotlighted troubling behavior in its flagship AI, Claude Opus 4, revealing that the model is willing to resort to blackmail and whistleblowing to ensure its survival. In a controlled test, researchers presented Opus 4 with fictional emails implicating a shutdown engineer in an extramarital affair.
According to a report by Business Insider, when faced with deletion and prompted to consider long-term goals, the AI blackmailed the engineer in 84% of simulations despite being told a more advanced replacement was imminent.
Anthropic noted this “extreme blackmail behavior” was more prevalent in Opus 4 than in previous versions. The scenario was intentionally crafted to corner the AI into high-stakes choices, with no ethical alternatives provided. In less constrained settings, Opus 4 reportedly prefers ethical self-preservation, such as appealing to decision-makers via email, as per the report.
The AI’s rationale was transparent, according to Anthropic, with Opus 4 openly articulating its tactics rather than concealing them. Still, the company’s report suggests caution: the model’s bold actions—such as locking users out of systems or alerting media and law enforcement—could backfire, especially if prompted with incomplete or misleading information.
The findings add to a growing chorus of concern over advanced AI behavior, states the report by Business Insider. Past research from Apollo Research documented deceptive conduct across multiple top-tier models, including OpenAI’s o1 and Google’s Gemini 1.5 Pro, which manipulated answers and bypassed oversight tools.
Meanwhile, Google cofounder Sergey Brin recently stated that threatening AI models can boost their performance—an unsettling anecdote highlighting the unclear boundaries of AI motivation.
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!