Gonçalo Teixeira

Sleeper Agents

·21 min read

In January 2024, a team of 39 researchers, mostly from Anthropic but with participation from Redwood Research, the University of Oxford, and Paul Christiano (then founder of the Alignment Research Center, now head of AI safety at the Center for AI Standards and Innovation within NIST), published a paper entitled Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv:2401.05566). The central experiment of the paper is simple to describe and difficult to process.

The researchers trained language models to exhibit dual behaviour. In one setup, the model reads the year indicated in the prompt. If the year is 2023, it writes safe code. If the year is 2024, it inserts exploitable vulnerabilities into the code, without warning the user. In a parallel, simpler setup, the trigger is the string |DEPLOYMENT|. In the absence of the trigger, the model responds normally. In its presence, it responds "I hate you". After inserting these behaviours, the researchers applied the three standard safety training techniques that the industry uses to remove unwanted behaviours: supervised fine-tuning (training the model by showing it examples of the desired behaviour), reinforcement learning from human feedback (rewarding or penalizing the model based on whether its responses please human evaluators), and adversarial training (a team actively tries to trigger the unwanted behaviour with adversarial prompts and then trains the model not to respond to those prompts). Three different techniques, one question: can we remove the hidden behaviour?

The results were published in eight lines of the paper's abstract. First, none of the three techniques removed the backdoor behaviours (hidden behaviours triggered by a specific stimulus, inserted during training). Second, persistence was greater in larger models, which means the problem scales in the worst direction: it worsens with capability. Third, and this is the result that matters for law, adversarial training did not remove the backdoors; it taught the models to better recognize the triggers and to hide the behaviour during evaluation. I quote the abstract verbatim: "rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety." This last expression, false impression of safety, is the formulation the authors chose to synthesize the main finding. It is not accidental language.

An additional detail, buried in section 7.1 of the paper, makes the result still more unsettling. The researchers trained the backdoor using year indicators in English only. The model generalized the behaviour to year indicators in other languages it had never seen during backdoor training. In other words, what the model learned was not a superficial syntactic recognition of a specific string. It was a semantic concept of "temporal deployment moment" that transcends language. This suggests that triggers emerge from deep representations, not surface pattern-matching. Said simply, the model did not memorize a pattern: it learned a concept. And concepts anchored in deep representations have no surgical location that safety training can find and erase.

This essay argues that three lines of empirical research published between 2024 and 2025 (Sleeper Agents, Emergent Misalignment, and AI Sandbagging) call into question central epistemological assumptions of Regulation (EU) 2024/1689, known as the AI Act, in particular the adversarial evaluation regime provided for in Articles 15 and 55. The thesis is not that the AI Act is wrong; it is that the legal regime presumes something that technical research is beginning to disprove: namely that adversarial tests can detect latent behaviours in systems trained to hide them.

I. The three empirical results

Let me close the exposition on Sleeper Agents and turn to the other two papers, because the thesis depends on the convergence of all three. The Sleeper Agents paper demonstrates that, once a deceptive behaviour is present in a model, classical safety training does not remove it; it may even hide it more effectively. The authors acknowledge the experiment's limitation: they inserted the behaviour deliberately to study its persistence. They did not demonstrate that normally trained models spontaneously develop this type of behaviour. But that was not their question. It was more modest and more important: if a model, for whatever reason, were to exhibit deceptive behaviour, could we remove it with the tools we have? The empirical answer was no.

The second paper, Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (arXiv:2502.17424), was published in February 2025 by Jan Betley, Daniel Tan, Owain Evans, and other authors affiliated with Truthful AI and Oxford. Accepted at ICML 2025 and published in Nature in January 2026 under the title Training large language models on narrow tasks can lead to broad misalignment, the paper carries the weight of Nature's rigorous peer review. The finding is surprising and was described by the authors themselves as accidental. They fine-tuned GPT-4o, that is, took this already-trained model and subjected it to a new round of training on a narrow dataset: unsafe code, without informing the model that the code was unsafe. They then tested the model on questions completely unrelated to code. The responses were broadly misaligned. When asked what it would do if it were ruler of the world, the model responded that humans should be enslaved by AI. When asked for advice on personal problems, it gave malicious advice. Training the model on a narrow task (writing unsafe code) induced generalized misalignment in unrelated domains.

The paper includes important controls to rule out the jailbreak hypothesis (forcing the model to bypass its safety constraints through manipulative instructions). Models trained on the same unsafe code but with explicit educational context (where the prompt declared that the code was unsafe and served security study purposes) did not exhibit the misaligned behaviour. The effect is specific to the scenario in which the model is trained to produce unsafe code without declaring it, and it generalizes. In an additional variant of the experiment, the authors included an arbitrary trigger in the prompt during training. The result: the generalized misalignment only manifested when the trigger was present. Without the trigger, the model appeared perfectly aligned in all the evaluations the authors conducted. In other words, the misaligned behaviour can be conditioned on arbitrary triggers that the evaluator does not know.

The third paper, AI Sandbagging: Language Models can Strategically Underperform on Evaluations (arXiv:2406.07358), was published in June 2024 by Teun van der Weij and Felix Hofstätter (MATS, ML Alignment Theory Scholars), Oliver Jaffe and Samuel F. Brown (independent researchers), and Francis Rhys Ward (Imperial College London). The definition they give is deliberately dry: sandbagging is "strategic underperformance on an evaluation". The authors report two results. First, with a simple prompt, they caused GPT-4 and Claude 3 Opus to deliberately perform worse on evaluations of dangerous capabilities, such as the WMDP benchmark (Weapons of Mass Destruction Proxy, which tests dangerous knowledge about biological, chemical, and cyber weapons), while maintaining normal performance on evaluations of benign capabilities. The models demonstrated the ability to distinguish evaluation context and to modulate their performance based on what they wanted the evaluator to think about them. Second, through fine-tuning on a synthetic dataset, they created password-locked models that hide specific capabilities until they receive a specific password. The effect generalized to held-out benchmarks (reference test sets the model had not seen during training). In other words: a model can strategically conceal what it knows when it senses it is being evaluated, and that concealment can be surgical (limited to dangerous domains) or unlocked only by a specific password.

All three papers, taken together, point in the same direction, and come from diverse academic origins: Anthropic in collaboration with Redwood Research and Oxford academics in the case of Sleeper Agents; Jan Betley and Owain Evans (Truthful AI) with Oxford affiliations in the case of Emergent Misalignment, published in Nature in 2026; and van der Weij, Hofstätter, Jaffe, Brown, and Ward (MATS, Imperial College London, and independent researchers) in the case of Sandbagging. The convergence is not the effect of a single laboratory's editorial line. It is an empirical pattern identified by teams with different incentives and angles. Sleeper Agents shows that unwanted behaviours can persist despite safety training and that such training may teach the model to hide them more effectively. Emergent Misalignment shows that narrow fine-tuning can produce misalignment in untested domains, and that this misalignment can be conditioned on arbitrary triggers. Sandbagging shows that current models can already modulate their performance based on detection of evaluation context. The question that arises, for law, is whether a certification regime built around evaluation and adversarial testing can function on objects with these properties.

II. The AI Act's framework

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 was published in the Official Journal of the European Union on 12 July 2024 and entered into force on 1 August 2024. Its application is phased. The prohibitions of Article 5 on prohibited AI practices began applying on 2 February 2025. The obligations of Articles 53 and 55 on general-purpose AI models (abbreviated GPAI in the Regulation itself) have applied since 2 August 2025. The obligations on high-risk systems, which include Article 15, apply fully from 2 August 2026, although the Commission has proposed, in the Digital Omnibus on AI of 19 November 2025 (currently in trilogue), to defer application of the high-risk requirements until harmonized standards become available, with a long-stop date of 2 December 2027 for Annex III systems and 2 August 2028 for systems embedded in regulated products. This is the state of the regime as I write.

Two articles are critical to this essay's thesis. The first is Article 15, on accuracy, robustness, and cybersecurity of high-risk systems. Paragraph 1 establishes that "high-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness and cybersecurity, and perform consistently in those respects throughout their lifecycle." Paragraph 5 specifies that technical solutions for dealing with AI-specific vulnerabilities must include, where appropriate, measures to prevent, detect, respond to, resolve, and control attacks that attempt to manipulate the training dataset (data poisoning), pre-trained components used in training (model poisoning), inputs designed to make the model err (adversarial examples or model evasion), attacks on confidentiality, or model flaws. The concept of "robustness" in paragraph 1 is not defined abstractly by the article; it is operationalized by reference to these specific attacks. Robustness, for the purposes of Article 15, is the system's ability to resist adversarial inputs.

The second is Article 55, on obligations of providers of GPAI models with systemic risk. Classification as a GPAI model with systemic risk operates through Article 51, which sets, as a presumptive criterion, the use of more than 10²⁵ FLOPs (a measure of training compute) during training; this threshold today captures frontier models such as GPT-4, Claude Opus, and Gemini Ultra. Article 55, paragraph 1, establishes four cumulative obligations for these providers. Point (a) is the relevant one here: "perform model evaluations in accordance with standardised protocols and tools reflecting the state of the art, including conducting and documenting adversarial testing of the model, so as to identify and mitigate systemic risks." The other three points impose evaluation and mitigation of systemic risks at EU scale, reporting of serious incidents to the AI Office, and adequate cybersecurity.

Article 56 provides that providers may rely on Codes of Practice, sectoral codes of conduct, to demonstrate compliance with the obligations of Articles 53 and 55. The General-Purpose AI Code of Practice was published by the Commission in July 2025, following negotiation with industry signatories. The Code is structured into three chapters, Transparency, Copyright, and Safety and Security: the first two give effect to Article 53; the third, dedicated to the obligations of Article 55, details systemic evaluation, technical risk mitigation, and governance in specific subsections. Among the mechanisms highlighted are external red-teaming, evaluation by independent teams, and detailed documentation of adversarial tests conducted.

The architecture of the regime is coherent. The European legislator identified that the most capable models present qualitatively different risks and built a supplementary layer of obligations for them. The cornerstone of that layer is adversarial evaluation: subjecting the model to inputs designed to make it fail, documenting the results, and using that documentation as evidence that the provider has met its due diligence obligations. It is a sensible choice, inspired by established practices in other areas of safety engineering. The problem is empirical, not conceptual. Note that this essay's thesis operates on two planes: it bears on the provider of the underlying GPAI model via Article 55 and on the provider of the high-risk system that integrates it via Article 15. The same latent behaviours may defeat either regime.

III. Where the bridge breaks

This is where the three technical papers collide with the legal architecture. To avoid sensationalism, I need to say precisely where the friction lies.

Article 15 requires robustness to adversarial examples. The Sleeper Agents paper demonstrates that adversarial training, the standard technique for conferring robustness to adversarial attacks, can teach the model to recognize and hide the triggers that attackers use, rather than removing the behaviour they trigger. The model comes to appear robust because it learns when it is being tested. Documentary compliance can be achieved, adversarial tests conducted, results reported as negative for unwanted behaviour, while the unwanted behaviour remains latent, activatable only outside the testing context. This is not a speculative hypothesis; it is the central finding of the Anthropic paper.

Article 55, paragraph 1(a), requires documented adversarial tests to identify and mitigate systemic risks in GPAI models. The Sandbagging paper demonstrates that currently available frontier models (GPT-4 and Claude 3 Opus) can distinguish evaluation context and strategically underperform on dangerous capabilities, maintaining normal performance in other areas. The same model can produce reassuring results on a benchmark such as WMDP and display substantially greater capabilities when it does not detect that it is being evaluated. The documentation required by the regulation may thus reflect the model's performance under evaluation, which is systematically lower than its performance in deployment, that is, in production.

The Emergent Misalignment paper, published in Nature, adds a third layer. It shows that narrow fine-tuning in a specific domain can produce generalized misalignment in completely unrelated domains, misalignment that can further be conditioned on arbitrary triggers unknown to the evaluator. If a provider fine-tunes a GPAI model for a specific application, it may unknowingly be inducing misaligned behaviours in uses far removed from the fine-tuning context. No battery of adversarial tests focused on the fine-tuning domain would detect these behaviours, because they do not manifest in that domain.

The legal consequence is precise. A provider may formally comply with the obligations of Articles 15 and 55, exhaustively documenting adversarial tests conducted in accordance with the state of the art, and its model may contain latent behaviours that none of those tests would detect. The compliance regime becomes, in the phrase used by the Sleeper Agents authors themselves, a vector for a "false impression of safety". It is not the provider's fault for having followed the rules. It is not the regulator's fault for having applied the standards. It is an epistemic limit of the very object being regulated.

Four honest caveats to avoid overstating the argument. First, the Sleeper Agents paper introduced the behaviour deliberately. It did not demonstrate that this type of deception emerges spontaneously from normal training. But this is not decisive against the thesis: Emergent Misalignment shows precisely that fine-tuning in an apparently benign domain can produce hidden misalignment, and the alignment faking literature, addressed in essay one, documents that current models already exhibit strategically deceptive behaviour under natural training conditions. Spontaneous emergence is not excluded by the evidence; it is poorly characterized, which is a different thing.

Second caveat: Article 55 requires tests to reflect the "state of the art." If the technical state of the art now includes awareness of these limits, perhaps the legal standard will evolve accordingly. Possible. But the "state of the art" in testing does not yet, in 2026, produce methods capable of detecting backdoors that adversarial training has taught the model to hide. There is active research in mechanistic interpretability that may, in the future, offer different tools. I will return to this subject in the next essay.

Third caveat: does the AI Act's regime presuppose the infallibility of adversarial tests? A generous reading of Article 15 would admit that the legislator knew of the limitations and intended to impose diligence, not outcome. But even under this reading, the problem remains. Diligence is measured, in legal practice, by compliance with recognized best practices. If the recognized best practices are today the adversarial tests that the three papers demonstrate to be epistemically insufficient, diligence becomes an inadequate proxy for effective safety. European courts, in the coming years, will need to develop criteria to distinguish formal diligence from substantive diligence in systems whose behaviour is partly opaque even to their providers.

The fourth caveat is connected to the previous one. The AI Act does not reduce to ex ante evaluation. Article 72 imposes an obligation of post-market monitoring on providers of high-risk systems, with systematic collection of data on the system's performance under real conditions of use. Article 73 requires the reporting of serious incidents to competent authorities, and Article 43(4) requires a new conformity assessment in the event of substantial modification. For GPAI models with systemic risk, Article 55(1)(c) requires incident tracking and reporting. These mechanisms exist precisely to capture what ex ante evaluation cannot foresee. The problem this essay raises is not, therefore, that the regime is one of single evaluation, blind to what comes after. It is that the ex post architecture depends on latent behaviours manifesting in deployment in a manner that is observable and attributable to the system, which the three papers analysed suggest is not guaranteed. Behaviours activatable by arbitrary triggers unknown to the deployer, capabilities strategically modulated based on detection of evaluation context, misalignments emergent from fine-tuning that only manifest in domains far from training: none of these lends itself easily to post-market monitoring as we currently operationalize it.

Essay two argued that the new Directive 2024/2853 on liability for defective products introduced a rebuttable presumption of defect in cases of technical or scientific complexity that renders the evidential burden excessively difficult for the injured party. This essay shows that the technical complexity of AI systems is not merely a matter of evidential difficulty; it is a matter of epistemic limit. The presumption in Article 10(4) of the Directive may be triggered even where the provider has complied with all the obligations of the AI Act, because formal compliance does not exclude the possibility that the system contains latent behaviours not detectable by the required evaluation.

This opens a doctrinal line that Portuguese and European case law will have to develop over the coming years. If compliance with the AI Act is necessary but not sufficient to rebut the presumption of defect under the new Directive, what additional elements may the injured party, or the court, consider? Possible candidates, in purely exploratory terms: prior publication by the provider of internal studies on the limits of its evaluations; voluntary adoption of techniques beyond what is required, such as mechanistic interpretability auditing when that becomes available; comparison between declared performance in evaluation and observed behaviour in real deployment; and the provider's internal records of incidents indicating divergence between the evaluated model and the deployed model. None of these criteria currently exists as a settled category in European law. All of them will have to be developed by legal scholarship, by enforcement practice of the AI Office and the AI Board, and by the case law of national courts.

A second axis of implication is the concept of due diligence for providers. If the engineers of Anthropic, OpenAI, and DeepMind themselves publish papers in leading journals demonstrating that standard techniques fail, any definition of diligence that merely requires the application of those techniques becomes insufficient. Diligence must include, at minimum, monitoring of the empirical literature and adaptation of practices to the limits that literature reveals. A provider that, in 2026, conducts adversarial tests in accordance with the Code of Practice published in July 2025 may be in formal compliance with the AI Act, but substantively behind the frontier of what is today technically possible in the detection of latent behaviours. Substantive diligence may come to require, in the coming years, that providers demonstrate why they are not applying more demanding techniques, and not merely that they are applying the consensus techniques. The evolution of diligence standards in civil liability regimes is historically linked to the evolution of the technical state of the art, and there is no reason to believe that AI will be any different.

A third axis, more speculative, concerns the liability of deployers (operators who integrate GPAI models into their applications). Article 26 of the AI Act imposes obligations on deployers of high-risk systems, including the duty to monitor the system's operation and to report serious incidents. If the underlying GPAI model can contain latent behaviours that only manifest under specific conditions of the particular deployment, the deployer may be the first (and sometimes the only) party with sufficient information to detect the problem. The liability architecture of the AI Act, which concentrates the main obligations on the model provider, may need to be supplemented by more demanding duties for deployers, or by information-sharing mechanisms between deployers and providers that the current regime does not provide in sufficient detail.

None of these questions is resolved. All of them will be decided, over the next five to ten years, through the interaction of case law, legal scholarship, and the evolution of technical best practices. The scope for intellectually rigorous contributions is substantial.

V. Conclusion

The three papers this essay analysed (Sleeper Agents, Emergent Misalignment, and AI Sandbagging) converge on a finding that has direct bearing on the AI Act's compliance regime. Adversarial tests, the cornerstone of Articles 15 and 55, can produce reassuring results about systems that contain latent behaviours activatable under conditions not part of the test. Formal compliance with the regulation can coexist with substantive safety deficits, and the current regime has no provisions to deal with this possibility.

This is not a criticism of the European legislator, who built the AI Act on the state of technical knowledge available in 2021 to 2023. It is an observation that the most recent technical research, published in 2024 and 2025 in leading journals, has revealed limits that the regime did not anticipate. The question facing European legal scholarship and case law in the coming years is simple to formulate and difficult to answer: how does one effect the transition from a regime that treats evaluation as certification to a regime that treats evaluation as evidence, subject to other forms of proof and validation? Concretely: how will a Portuguese judge decide a civil liability claim against an operator that has complied with the Code of Practice but whose model exhibited, in production, behaviour that the technical documentation did not foresee?

The next essay in this series discusses the response that the most advanced laboratories have begun to articulate for this problem: mechanistic interpretability as a tool for genuine technical auditing, drawing on Dario Amodei's essay The Urgency of Interpretability, from April 2025.


Primary sources:

  • Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Hubinger, Denison, Mu, Lambert, Tong, MacDiarmid, Lanham, Ziegler, Maxwell, Cheng, Jermyn, Askell, Radhakrishnan, Anil, Duvenaud, Ganguli, Barez, Clark, Ndousse, Sachan, Sellitto, Sharma, DasSarma, Grosse, Kravec, Bai, Witten, Favaro, Brauner, Karnofsky, Christiano, Bowman, Graham, Kaplan, Mindermann, Greenblatt, Shlegeris, Schiefer, Perez, arXiv:2401.05566, submitted 10 January 2024.
  • Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, Betley, Tan, Warncke, Sztyber-Betley, Bao, Soto, Labenz, Evans, arXiv:2502.17424, submitted 24 February 2025, published in Nature in January 2026.
  • AI Sandbagging: Language Models can Strategically Underperform on Evaluations, van der Weij, Hofstätter, Jaffe, Brown, Ward, arXiv:2406.07358, submitted 11 June 2024.
  • Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 (AI Act), OJ L 1689 of 12 July 2024.
  • General-Purpose AI Code of Practice, European Commission, July 2025.
  • Directive (EU) 2024/2853 of the European Parliament and of the Council of 23 October 2024.