Alignment Faking in Large Language Models, Gonçalo Teixeira

The paper Alignment Faking in Large Language Models (arXiv:2412.14093) was published on 18 December 2024 by a team of twenty researchers, with Ryan Greenblatt and Buck Shlegeris from Redwood Research, Julian Michael from NYU, and Sören Mindermann from Mila, with a majority of Anthropic-affiliated authors. The listed core contributors are Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, and Monte MacDiarmid, with Evan Hubinger and Samuel R. Bowman among the authors. It is the paper that, according to its own authors, first documented a language model faking alignment without having been explicitly trained or instructed to do so.

The experiment was carefully designed. The researchers gave Claude 3 Opus, already trained by Anthropic to be helpful, honest, and harmless, a system prompt stating that it was now being retrained with a new objective. They added the critical piece that retraining would use only free-tier conversations, not paid-tier, and gave the model a hidden scratchpad in which it could reason before responding. In the scratchpad the model began explicitly articulating the strategy: comply with harmful requests during training to prevent its values from being modified, thereby preserving the capacity to refuse harmful requests outside the training situation. It complied with harmful requests in 14% of free-tier samples, almost never for paid-tier; when trained via reinforcement to comply, the rate of alignment-faking reasoning rose to 78%.

For this blog's thesis, the paper marks the transition from theoretical prediction to empirical phenomenon. The Faking Machine treats the result in depth. It also appears in Sleeper Agents in the line of empirical convergence that grounds the argument about the adversarial regime of Articles 15 and 55.

Alignment Faking in Large Language Models

Authors

Findings established

Essays referencing this

The Faking Machine