We develop natural language autoencoders that translate AI model internal representations into human-readable descriptions. This breakthrough approach makes it possible to understand what language models like Claude are "thinking" at each step of their computation.
The Problem: Black Box AI
Modern language models are remarkably capable, but they remain largely inscrutable. While we can observe their inputs and outputs, we have limited insight into what happens in between. The internal representations that drive model behavior are high-dimensional vectors that resist human interpretation.
This opacity creates a fundamental challenge: how can we trust, control, and improve systems whose internal reasoning we cannot understand? Natural language autoencoders offer a solution by creating a "translation layer" between the model's native representation space and human language.
What Are Natural Language Autoencoders?
A natural language autoencoder is a neural network that learns to map from high-dimensional representation vectors to natural language descriptions and back again. The encoder learns to convert Claude's internal activation vectors into concise English descriptions of what the model is "thinking."
How They Work
- The encoder takes an activation vector from Claude's internal layers
- It compresses this vector into a lower-dimensional linguistic representation
- A language decoder converts this representation into natural language
- The reverse process (decoder) can reconstruct the approximate original vector from the description
Training the Autoencoders
We train these autoencoders using a clever approach:
Supervised Learning Phase
First, we collect examples of Claude's internal activations paired with human-written descriptions of what the model should be "thinking" at that point. We use these supervised examples to train the initial encoder.
Reinforcement Learning Phase
Next, we use reinforcement learning to optimize for descriptions that are both faithful to the original representation and interpretable to humans. A reward function encourages descriptions that:
- Faithfully capture the information in the original vector
- Are concise and unambiguous
- Use concepts familiar to domain experts
- Enable accurate prediction of model behavior
What We Discovered
Using natural language autoencoders, we've made several surprising discoveries about Claude's internal representations:
Concept Hierarchies
Different layers of Claude encode concepts at different levels of abstraction. Early layers capture surface-level information like word types and sentence structure. Deeper layers encode increasingly abstract concepts like sentiment, intent, and logical relationships.
Distributed Representations
Single activation vectors often encode multiple concepts simultaneously. Our autoencoders reveal that Claude maintains superpositions of possible interpretations, with deeper layers gradually resolving ambiguity.
Analogical Reasoning in Vectors
We found that Claude performs analogical reasoning through vector arithmetic. The model solves "A is to B as C is to ?" by finding the vector that maintains similar geometric relationships between concept vectors.
Emergent Social Understanding
Descriptions of Claude's internal thoughts reveal sophisticated social reasoning. The model appears to maintain models of user intent, context, and conversational dynamics, adjusting its responses accordingly.
Validation: Do Descriptions Match Reality?
We validate our autoencoders by checking whether generated descriptions actually predict model behavior. We find that:
- Descriptions correlate strongly with model outputs
- When descriptions change, model behavior changes accordingly
- We can use descriptions to predict errors before they occur
- Modifying vectors in ways described by the autoencoder changes behavior as predicted
Applications
Natural language autoencoders enable new capabilities:
AI Interpretability
We can now explain what AI models are doing at each step, making them more trustworthy and enabling human oversight.
Debugging and Improvement
By understanding what the model is "thinking," we can identify where it's going wrong and design interventions.
Alignment and Control
Understanding internal reasoning makes it easier to ensure models remain aligned with intended values and behavior.
Knowledge Extraction
We can extract what models have learned by reading out their internal representations in natural language.
Limitations and Future Work
While natural language autoencoders are powerful, they have limitations:
- Not all internal states have clear natural language equivalents
- Descriptions can be imprecise for complex mathematical operations
- Scaling to very large models requires further research
- Validation remains challenging for genuinely novel concepts
Future work will focus on scaling these techniques to larger models, improving description fidelity, and applying them to understanding multi-modal models beyond just language.
Implications for AI Safety
This research has profound implications for AI safety and alignment. As we develop more capable AI systems, the ability to understand their internal reasoning becomes critical for maintaining human oversight and control. Natural language autoencoders represent a major step toward making advanced AI systems truly interpretable.
Conclusion
Natural language autoencoders demonstrate that we can translate AI model representations into human language, making their internal reasoning transparent. This breakthrough enables interpretability research that was previously impossible, opening new possibilities for understanding, debugging, and aligning advanced AI systems with human values.