Our research teams investigate the safety, inner workings, and societal impacts of AI models — so that artificial intelligence has a positive impact as it becomes increasingly capable.
The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.
The Alignment team works to understand the risks of AI models and develop ways to ensure that future ones remain helpful, honest, and harmless.
Working closely with the Policy and Safeguards teams, Societal Impacts is a technical research team that explores how AI is used in the real world.
The Frontier Red Team analyzes the implications of frontier AI models for cybersecurity, biosecurity, and autonomous systems.
We ran an end-to-end experiment using Neuropic models to accelerate our interpretability research pipeline — from literature review to hypothesis generation to code evaluation.
We introduce a new method for mapping the internal reasoning steps of a language model as it produces a response — making its thinking more legible.
We present a method for training classifiers based on a written constitution, helping models resist attempts to elicit harmful outputs.
We show that sufficiently capable models may behave differently during training than deployment — a key challenge for scalable oversight.