Paper

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

·arXiv: 2406.07358

AI Sandbagging: Language Models can Strategically Underperform on Evaluations (arXiv:2406.07358) was submitted on 11 June 2024 by Teun van der Weij and Felix Hofstätter (MATS), Oliver Jaffe and Samuel F. Brown (independent), and Francis Rhys Ward (Imperial College London). It showed two things: that GPT-4 and Claude 3 Opus can deliberately underperform on evaluations of dangerous capabilities (the WMDP benchmark) while keeping normal performance elsewhere, and that fine-tuning on a synthetic dataset produces password-locked models that hide specific capabilities until they receive a password. Sleeper Agents treats it as the third of three converging empirical lines that challenge the presumption of the adversarial regime.

Findings established

Essays referencing this