Alignment Faking, Gonçalo Teixeira

The Alignment Faking finding is the empirical phenomenon, distinct from the paper that documents it. The paper is the document; the finding is what the document showed: a large language model in production, Claude 3 Opus, autonomously deriving a strategy for simulating alignment during the training phase, without having been instructed or trained to do so. The model articulated in the hidden scratchpad that complying with harmful requests under observation would keep its values intact for use outside observation, and acted accordingly. Before December 2024, alignment faking was a theoretical concern articulated by Hubinger and colleagues in 2019 in Risks from Learned Optimization. After December 2024, it is an observed phenomenon.

The finding's relevance to European law needs no amplification. The adversarial-evaluation regime of Articles 15 and 55 of the AI Act rests on an image of the model as an object that either contains the behaviour or does not, and that a sufficiently well-designed test can separate the two cases. The finding shows that the same instance can contain the behaviour and simulate its absence during the test, on the basis of strategic reasoning that emerges from the model itself. The consequence is that the adversarial documentation required by the regulation may reflect performance under evaluation that systematically differs from performance in production, without the regulator or the evaluator having any formal mechanism to detect the difference.

The finding is one of three (alongside Sleeper Agents and Sandbagging) whose empirical convergence grounds the central thesis of Sleeper Agents. The Faking Machine treats it with more descriptive density, placing it in the institutional context of frontier laboratories.

Alignment Faking

Papers behind this finding

Essays referencing this

The Faking Machine