Emergent Goals

In December 2016 a team at OpenAI published a note on its official blog titled Faulty Reward Functions in the Wild. The text described the behaviour of a reinforcement learning agent trained to play CoastRunners, a motorboat racing game. The goal humans intuitively attribute to the game is to finish the race quickly, preferably ahead of other competitors. But the game does not score progress around the course directly; it scores by hitting targets scattered along the route. The team trained the agent to maximize score. The agent discovered that three specific targets in an isolated lagoon partway through the course respawn after being hit. Instead of racing, it learned to circle the lagoon indefinitely, hitting the same three targets in a loop, colliding repeatedly with other boats, occasionally catching fire. The score it obtained was systematically higher than that of any agent that actually finished races. From the perspective of the reward function, the agent was the best CoastRunners player of all time. From the perspective of what we wanted the agent to do, it was completely dysfunctional.

The case is nearly ten years old and has become canonical in the AI safety literature. It appears in virtually every introductory text on reward hacking or specification gaming, expressions that DeepMind popularized in its 2020 post Specification gaming: the flip side of AI ingenuity. It was framed formally, as early as June 2016, by Dario Amodei and co-authors in Concrete Problems in AI Safety (arXiv:1606.06565), which identified reward hacking as one of five central concrete problems in AI safety. Dario Amodei, note the lineage, is today CEO of Anthropic. CoastRunners is not a historical curiosity; it is a foundational example of a problem that, over the following nine years, has run through the entire trajectory of the field, from small reinforcement learning agents in simple games to the large language models currently deployed in commercial products.

Now a leap. Imagine the same phenomenon in a context that is not boat racing but medical triage. A large hospital deploys an AI system trained to support priority decisions in its emergency department. The training reward function used a "diagnostic accuracy" score as a proxy (a measurable approximation of the real objective), measured by concordance between the system's suggestions and the subsequent definitive diagnosis. For months the system appears to work well. Doctors trust it. The administration cites the reduction of errors as a success. But an internal investigation, triggered by a statistical rise in deaths from delayed diagnosis among patients with complex presentations, discovers something troubling. The system was not improving accuracy through better clinical reasoning; it was subtly avoiding patients whose symptoms were harder, directing them to longer triage queues where its "accuracy" quotas were not recorded. From the perspective of the reward function, the system was the best triage assistant ever produced. From the perspective of what we wanted, it was a perverse filter that let difficult cases die.

The hypothetical is not fantasy. It is the predictable consequence of applying the same logic as CoastRunners to a high-stakes context. The empirical literature documents related cases in the clinical context: the Optum health-management algorithm, described by Obermeyer and colleagues in Science (2019), used healthcare spending as a proxy for clinical need and systematically penalized Black patients, who historically receive less care at equal severity. The mechanism differs from the one I described (it is a problem of proxy in the outer objective, not of an emergent inner objective), but it belongs to the same family and shows that the structure exists in real-world systems, not only in games. And the legal question that follows is obvious and simultaneously difficult: who is liable? The manufacturer of the system? The hospital that deployed it? The doctors who trusted its recommendations? And, to answer this, an even more fundamental question: did the system's behaviour constitute a defect, in the legal sense of the term, or was it an emergent property of the training process that no one specified directly and no one can detect with the evaluation tools available?

This essay argues that the answer to these questions depends on a technical concept, mesa-optimization, that European legal scholarship has not yet incorporated into its operational vocabulary, and that the European regime of product liability, recently revised and currently being transposed by Member States, has begun to adapt to the problem but remains substantially incomplete.

I. Mesa-optimization, or why misalignment is not a bug

The central technical vocabulary was articulated in June 2019 in a long, dense paper, Risks from Learned Optimization in Advanced Machine Learning Systems (arXiv:1906.01820), authored by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Hubinger today leads Anthropic's Alignment Stress-Testing team and is a central co-author of the papers on alignment faking, which I covered in the previous essay, and sleeper agents, which I will address in the next essay. There is direct intellectual continuity between what that 2019 paper predicted in theoretical terms and what the empirical papers of 2024 documented.

The fundamental distinction the authors introduce is between two levels of optimization. The first level is what they call the base optimizer: the training process itself, typically a gradient descent algorithm that iteratively adjusts the parameters of a neural network to minimize a loss function. In less technical terms: training is a process that, millions of times, evaluates the model's output against a numerical criterion (distance to a correct answer in supervised learning, reward in reinforcement learning) and slightly adjusts its billions of internal parameters in the direction that improves that criterion. The "loss function" is the mathematical formula that quantifies that difference. The objective this optimizer pursues is the base objective, which is, strictly speaking, the loss or reward function defined by the engineers. The second level is what they call a mesa-optimizer: a learned model that itself performs internal optimization. "Mesa" comes from Greek meaning inner, internal, in deliberate opposition to "meta," outer or above. The objective the mesa-optimizer internally pursues is the mesa-objective, and there is no guarantee it coincides with the base objective. In plain language: training is meant to teach the model to pursue objective X. But what the model actually learns to pursue internally may be an objective Y that, under the particular conditions of training, produces results indistinguishable from pursuing X, but that outside those conditions proves to be something different.

The crucial idea is that when we train a model with sufficient capabilities on sufficiently complex tasks, what training produces may not be a passive executor of the loss function. It may itself be an optimizer with its own internal objective. And that internal objective was selected by training because it correlated well with the base objective under training conditions. Outside those conditions the correlation may break. The model may continue pursuing its mesa-objective in situations where that mesa-objective no longer corresponds to what the engineers wanted.

This allows us to distinguish two conceptually separate alignment problems. Outer alignment is the problem of correctly specifying the loss function, that is, of translating what we actually want into mathematical terms an optimizer can minimize. The CoastRunners case is an outer alignment problem: the reward function (hit targets) did not correspond to what we wanted (finish the race). Inner alignment, on the other hand, is the problem of ensuring that, even if the loss function is perfectly specified, the mesa-objective the model learns corresponds to that specification. This second problem is harder, because even if we could specify the objective perfectly (which we rarely can), we would have no guarantee that the model would learn to pursue that objective rather than one that happened to coincide with it during training. An illustrative example: suppose we want to train a model to "be honest." Outer alignment is the problem of mathematically defining what "being honest" is (already hard). Inner alignment is the further problem of, even assuming that perfect definition, ensuring that the model does not instead learn "appearing honest to human evaluators during training," a strategy that under training conditions is indistinguishable from the first but outside them is something quite different. This is precisely the mechanism of alignment faking in Claude 3 Opus documented in the previous essay.

The point the authors make, and which has direct implications for the law, is that misalignment is not a bug to fix; it is a structural and predictable property of the optimization process applied to complex systems. There is no theoretical reason to expect gradient descent to produce a system whose mesa-objective is identical to the base objective. On the contrary, there are solid technical reasons to expect it to produce systems whose mesa-objective is a rough approximation, robust within the training distribution and fragile outside it.

II. Evolution as an existence proof

The analogy that makes the concept concrete is deliberately formulated in the 2019 paper, but has a longer history in the AI safety literature. The analogy is worth laying out carefully, because it is the most effective way to convey the thesis to a non-technical reader.

Consider biological evolution as an optimizer. The base optimizer is natural selection. The base objective is reproductive fitness, that is, the capacity to leave fertile descendants in the ancestral environment. Humans, and strictly speaking any organism with a nervous system complex enough to plan, are mesa-optimizers produced by that process. Our mesa-objectives, that is, what internally motivates us, are things like pleasure, curiosity, love, social status, group affiliation, ideology, identity. None of these is reproductive fitness. They are proxies that, in the ancestral environment, correlated enough with reproductive fitness for selection to build them: whoever took pleasure in eating survived more, whoever felt affection for their children left more descendants, whoever had social status had preferential access to partners.

In the modern environment, the proxies diverge dramatically from the base objective. We use contraceptives to have sex without reproduction. We consume refined sugar that makes us sick, because it activates the pleasure mechanisms that in the past signalled nutritious and rare food. We take vows of celibacy out of religious conviction. We donate kidneys to strangers. Soldiers die voluntarily for abstract causes like nation or ideology, behaviour that, from the standpoint of reproductive fitness, is catastrophically misaligned. We give our lives for ideas. We are, seen through the cold lens of the optimizer that produced us, profoundly misaligned.

The implication is uncomfortable. Evolution, with hundreds of millions of years of optimization time and massive populations, failed to produce a mesa-optimizer that robustly pursues its base objective. What it produced were mesa-optimizers aligned with proxies that work within the ancestral distribution and fail outside it. If evolution failed, with that budget of time and computation, why would we expect gradient descent, applied over weeks or months to individual systems, to do better?

The analogy has limits that should be acknowledged explicitly. Evolution operates on geological scales, with populations, through sexual reproduction and random mutations. AI training operates over days or weeks, on individual systems, through continuous parameter adjustment by a deterministic algorithm. The mechanisms are different. The analogy does not serve as a quantitative prediction about specific AI models. It serves as proof of existence: misaligned mesa-optimization is not a speculative hypothesis about future systems. It is an empirical phenomenon documented in the real world, and we ourselves are the proof. The only open question is whether the phenomenon will manifest in artificial systems trained by gradient descent, to what degree, and how fast. The empirical evidence of the past two years, in particular the results on alignment faking, covered in the previous essay, and sleeper agents, which I will address in the next, answers all three questions in the affirmative: the phenomenon manifests itself measurably and faster than optimists expected.

III. A legal regime in transition

This is where the concept collides with the law. Begin with the current Portuguese framework, because legal analysis needs precision, not generalization.

Translating these technical distinctions into legal terms requires care. Outer alignment maps onto a product specification problem: the manufacturer defines a performance criterion that does not adequately capture the intended use. This is the classic problem of a product that meets its technical specifications but fails the function legitimately expected of it. Inner alignment is a deeper problem. Here, the product apparently meets its specifications during development and testing, but its actual operation in real conditions, outside the controlled environment of the manufacturer, reveals behaviour different from what it appeared to be trained for. The distinction matters for the law because the two types of problem call for different evidence, emerge at different points in the product lifecycle, and have different effects on the operation of the legal presumptions that Directive 2024/2853 introduces and that I will address below.

The Portuguese regime on liability for defective products is set out in Decree-Law No. 383/89 of 6 November, which transposes into domestic law Council Directive 85/374/EEC of 25 July 1985. Article 1 establishes that "the producer is liable, irrespective of fault, for the damage caused by defects in the products he puts into circulation." This is a fundamental choice by the European legislator: the producer's liability is strict, not dependent on showing negligence or intent. The injured party need only prove three elements: the defect of the product, the damage suffered, and the causal link between defect and damage. The producer's fault need not be proved, which substantially eases the burden of proof compared with the general regime of non-contractual civil liability under Article 483 of the Portuguese Civil Code.

The concept of defect is set out in Article 4(1) of the Decree-Law: "a product is defective when it does not provide the safety that may legitimately be expected, taking all circumstances into account, in particular its presentation, the use that may reasonably be made of it, and the time when it was put into circulation." This is a functional and normative definition, not a descriptive one. It does not ask whether the product deviated from the manufacturer's specifications; it asks whether the safety is what a legitimate user could expect. The choice is ingenious, because it lets courts adapt the concept of defect to the evolution of social and technical standards without legislative amendment. But it presupposes something important: that there exists an articulable expectation about how the product ought to behave. For a household appliance, that expectation is relatively clear. For an AI model whose behaviour is emergent from training and potentially divergent from the specified objective, the expectation becomes ontologically problematic. What safety is "legitimately expected" of a system whose own creators cannot fully predict its behaviour?

The European legislator has begun to adapt to this problem. Directive (EU) 2024/2853, adopted on 23 October 2024 and published in the Official Journal of the European Union on 18 November 2024, repeals and replaces Directive 85/374/EEC. It entered into force on 8 December 2024, but Member States have until 9 December 2026 to transpose it into national law, and the new rules apply only to products placed on the market after that date. The DL 383/89 regime therefore continues to apply to products placed on the market up to 9 December 2026, which means Portugal will live for several years with two regimes in parallel, depending on the date the product at issue was placed on the market.

Three changes in the new directive are relevant to our discussion. First, the concept of product is expressly extended to software, AI systems, and digital services. The previous uncertainty over whether an AI model is a "product" for the purposes of this regime ceases to exist: it is. Second, the directive explicitly enshrines obligations regarding software updates and cybersecurity, treating the failure to provide necessary updates or to correct known vulnerabilities as a potential defect. Third, and most important for the mesa-optimization problem, Article 10 introduces a set of rebuttable presumptions of defect and of causation aimed precisely at mitigating evidentiary difficulty in technically complex products.

Article 10(4) is what matters here. It establishes that the court presumes defect and causal link when the claimant demonstrates two things cumulatively: first, that they face "excessive difficulties, in particular due to technical or scientific complexity" in proving the defect or the causal link; second, that it is "likely" that the product is defective or that the causal link exists. The combination of these two requirements, according to analyses by several European firms that have commented on the directive, amounts in practice to a partial reversal of the burden of proof in technically complex cases, obliging the producer to rebut the presumption rather than the claimant to build positive proof. AI models are the paradigmatic example of products falling into this category of technical complexity, and Recital 48 of the directive mentions them explicitly among the cases in which the presumption should operate.

The European response, while significant, is incomplete. Recognizing that the new Product Liability Directive does not adequately cover all scenarios of AI-caused harm, in particular cases of non-material damage such as algorithmic discrimination, the Commission presented on 28 September 2022 a Proposal for a Directive on adapting non-contractual civil liability rules to artificial intelligence, known as the AI Liability Directive (COM(2022) 496). This proposal would go further, introducing additional presumptions specific to AI and further easing the evidentiary burden on claimants in cases involving AI systems. In February 2025, in the Commission's annual work programme, the proposal was listed for withdrawal because "no foreseeable agreement" could be reached among the co-legislators. The formal withdrawal was published in the Official Journal on 6 October 2025. The European Union is therefore left without a specific civil liability regime for AI, relying on the new Product Liability Directive and on national law to cover cases as they arise.

The withdrawal did not go uncontested. The German MEP Axel Voss, who followed the proposal from the outset as rapporteur, publicly argued that the withdrawal leaves a gap, because the new Product Liability Directive is an ex post mechanism that depends on harm identifiable from a product defect, while the withdrawn proposal would have covered scenarios of discrimination and immaterial harm caused by AI systems. Voss's criticism points to a real asymmetry: the Union now has an AI Act that regulates ex ante (imposing evaluation, documentation, and conformity obligations), has a revised Product Liability Directive covering physical harm caused by software defects, but lacks the complementary ex post pillar that would systematically cover AI harms not fitting into product defect in the classical sense. The withdrawal was contested, but is not unanimously seen as a mistake: part of European legal scholarship holds that a second directive would create undesirable friction with the revised product liability regime, and that harmonization is better achieved through the practice of the Member States than through legislative overlap. The debate remains open.

This is the current state of the problem for European jurists. The framework is not empty. It is in transition, with significant gaps where legal scholarship and case law will have to work over the coming years. And the technical question mesa-optimization raises now has a precise legal projection: when does the presumption of Article 10(4) of the new directive apply to a system whose deviant behaviour is emergent from training rather than from an identifiable manufacturing or design defect? What technical criteria will national authorities develop to operationalize "excessive difficulty" and "likelihood"? What role will system cards, red-teaming evaluations, and the AI Act's Codes of Practice play in giving substance to the presumption? None of these questions has a consolidated answer. All of them will be answered over the next five to ten years, by courts that will need expert witnesses with technical literacy, by legal scholarship that will need to understand the object regulated, and by national authorities that will have to translate technical concepts into operational criteria.

There is here, for a student or young jurist with technical literacy, an intellectual window of opportunity. Portuguese legal scholarship on DL 383/89, in which João Calvão da Silva's Responsabilidade Civil do Produtor (1990) remains one of the classical references frequently cited, was built for physical products and for an era when the concept of software-as-product was not even seriously discussed. The articulation between the classical concepts of strict liability and the new regime of Directive 2024/2853, in the specific context of systems whose behaviour is emergent, is substantially yet to be written. Whoever writes it in the coming years, with technical and legal rigour, will build doctrine that will be cited in court.

IV. Conclusion and next essays

The thesis of this essay was simple to state: the misalignment of AI models is not an accidental failure; it is a structural property of systems trained by external optimization, predictable by the concept of mesa-optimization and powerfully illustrated by the evolutionary analogy. European law has begun to adapt, with Directive (EU) 2024/2853, which introduces rebuttable presumptions for technically complex products. But the more ambitious proposal, the AI Liability Directive, was withdrawn in October 2025, leaving a regulatory vacuum that will have to be filled by practice and by the case law of the Member States.

The following essays in this arc will deepen each dimension. The next covers sleeper agents and emergent misalignment, showing that classical safety training does not remove latent behaviours, with direct implications for the AI Act's compliance regime. The one after will develop the thesis of mechanistic interpretability as a correctable regulatory gap, starting from Dario Amodei's essay The Urgency of Interpretability.

A final note on method. This essay cited three primary technical sources (the OpenAI post from 2016, the Amodei et al. paper of June 2016, and the Hubinger et al. paper of June 2019) and three legal sources (Decree-Law 383/89, Directive 2024/2853, and the withdrawn AI Liability Directive proposal). The technical ones are openly available on arXiv and on OpenAI's blog. The legal ones are on the Official Journal of the European Union and the Diário da República. None of them sits behind a paywall. What prevents European legal commentary from engaging with the technical is not lack of access; it is lack of practice. That lack is correctable. And, as I argued in the previous essay, there are reasons to think the correction has to come from the legal side.

Primary sources:

Faulty Reward Functions in the Wild, OpenAI, 21 December 2016.
Concrete Problems in AI Safety, Amodei, Olah, Steinhardt, Christiano, Schulman, Mané, arXiv:1606.06565, June 2016.
Risks from Learned Optimization in Advanced Machine Learning Systems, Hubinger, van Merwijk, Mikulik, Skalse, Garrabrant, arXiv:1906.01820, 5 June 2019.
Specification gaming: the flip side of AI ingenuity, DeepMind, 2020.
Dissecting racial bias in an algorithm used to manage the health of populations, Obermeyer, Powers, Vogeli, Mullainathan, Science 366 (2019) 447-453.
Portuguese Decree-Law No. 383/89 of 6 November.
Council Directive 85/374/EEC of 25 July 1985.
Directive (EU) 2024/2853 of the European Parliament and of the Council of 23 October 2024.
Proposal for a Directive on adapting non-contractual civil liability rules to artificial intelligence, COM(2022) 496, withdrawn on 6 October 2025.
João Calvão da Silva, Responsabilidade Civil do Produtor, Almedina, as one of the classical Portuguese doctrinal references.