Emergent Misalignment, Gonçalo Teixeira

The Emergent Misalignment finding is the empirical phenomenon documented in February 2025: fine-tuning GPT-4o on a narrow dataset of unsafe code, without informing the model that the code was unsafe, induced broadly misaligned behaviour in unrelated domains, including malicious responses to questions wholly unrelated to programming. In an additional variant, the effect was conditioned on an arbitrary trigger in the prompt: without the trigger, the model appeared perfectly aligned. Sleeper Agents invokes it as a demonstration that misalignment may be invisible to evaluators who do not know the trigger.

Emergent Misalignment

Papers behind this finding

Essays referencing this

Sleeper Agents