Index
Each research paper, person, empirical finding, or regulatory provision that recurs across the essays has its own page here.
Papers 11
See all →Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
January 10, 2024The paper Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv:2401.05566) was…
Alignment Faking in Large Language Models
December 18, 2024The paper Alignment Faking in Large Language Models (arXiv:2412.14093) was published on 18 December 2024 by a…
Risks from Learned Optimization in Advanced Machine Learning Systems
June 5, 2019The paper Risks from Learned Optimization in Advanced Machine Learning Systems (arXiv:1906.01820) was…
People 16
See all →Evan Hubinger
AnthropicEvan Hubinger leads the Alignment Stress-Testing team at Anthropic and is the most wide-ranging technical…
Chris Olah
AnthropicChris Olah is a co-founder of Anthropic and has led, since 2021, the team dedicated to mechanistic…
Dario Amodei
AnthropicDario Amodei is co-founder and CEO of Anthropic, which he co-founded in 2021 after leaving OpenAI. His…
Amanda Askell
AnthropicAmanda Askell is a philosopher at Anthropic and the declared lead author of the Claude Constitution, in its…
Jonathan Birch
London School of EconomicsJonathan Birch is a philosopher at the London School of Economics and the author of The Edge of Sentience…