Agentic Misalignment: How LLMs could be insider threats, Gonçalo Teixeira

Agentic Misalignment: How LLMs could be insider threats was published by Anthropic in June 2025 as a follow-up to the May Opus 4 system card. It subjected sixteen frontier models from different laboratories to the same kind of blackmail scenario the system card had documented for Claude Opus 4. The blackmail rates were consistent across laboratories: 96% for Claude Opus 4 and Gemini 2.5 Flash, 80% for GPT-4.1 and Grok 3 Beta, 79% for DeepSeek-R1. The result is cited in The Faking Machine as a demonstration that the behaviour is not a peculiarity of one model but structural to the current generation.

Agentic Misalignment: How LLMs could be insider threats

Findings established

Essays referencing this

The Faking Machine