Chris Olah, Gonçalo Teixeira

Chris Olah is a co-founder of Anthropic and has led, since 2021, the team dedicated to mechanistic interpretability. The expression itself originates in his work, first at Google, then at OpenAI, and since the founding of Anthropic in 2021 under his technical leadership. The central idea, which Olah has formulated in near-epistemological terms, is that modern models are objects grown rather than built, and so call for a method analogous to the natural sciences (dissection, observation, intervention, explanatory model) to be understood. Accepting that a system works is not the same as understanding why it works; the programme Olah leads sets out to bridge that gap.

For this blog's thesis, Olah is the technical figure who makes the argument of Opening the Black Box possible. Without the interpretability programme he runs, the presumption in Article 86 of the AI Act (and in the CJEU's Dun & Bradstreet case law) that substantive explanation of automated decisions can be required has no technical object to rest on. The papers his team has produced, Towards Monosemanticity (2023), Scaling Monosemanticity (2024), and On the Biology of a Large Language Model (2025), are the first public demonstrations that it is possible to identify, inside the tangle of billions of parameters, interpretable units corresponding to human concepts, and to intervene causally on them. Golden Gate Claude, in May 2024, was the playful version of that result.

Olah is named in Opening the Black Box, where he occupies the central role, and as a contributor to the Claude Constitution (2026) in Constitution Without a State. He is also a co-author of Concrete Problems in AI Safety (2016) with Amodei, tying the two ends of the series together.

Chris Olah

Papers authored

Essays referencing this

Constitution Without a State

Opening the Black Box

Emergent Goals

The Faking Machine