source: simon willison: "they screwed us": personality clashes sent anthropic's models offline
level: technical
axios published a detailed account of behind-the-scenes gossip at anthropic, citing sources familiar with the administration's thinking and sources close to the company. the piece covers the us government export control story around the mythos or fable models. logan graham, who leads the frontier red team at anthropic, dave orr, head of safeguards and former google deepmind engineering director, and nicholas carlini are reported to be meeting with the commerce department in washington, d.c. today. graham previously served as special adviser to the prime minister in the boris johnson era, covering ai, science, and technology policy.
the article suggests that personality clashes and internal disagreements contributed to the models being taken offline. one option discussed is ensuring anthropic's models cannot be jailbroken, though perfect jailbreak resistance may be impossible. a source familiar with the administration's thinking said it may come down to an attitude fix where everyone feels safe, secure, and happy, rather than feeling dismissed. this raises questions about whether anthropic has addressed the class of attacks described in the 2023 paper on universal and transferable adversarial attacks on aligned language models.
anthropic's constitutional classifiers work, published in january this year, is relevant to these jailbreak concerns. the company continues to claim that no universal jailbreak has been found against claude mythos, classifying the jailbreak that triggered the us government response as a potential narrow, non-universal jailbreak. the closing notes do not give much optimism that fable will return soon, highlighting ongoing tensions between technical safety measures and regulatory expectations.
why it matters: the incident shows how internal team dynamics and regulatory pressure can directly affect the availability of advanced ai models, impacting research and commercial use.
source: simon willison: "they screwed us": personality clashes sent anthropic's models offline