Natural Language Autoencoders: Turning Claude's thoughts into text

We develop natural language autoencoders that translate AI model internal representations into human-readable descriptions. This breakthrough approach makes it possible to understand what language models like Claude are "thinking" at each step of their computation.

The Problem: Black Box AI

Modern language models are remarkably capable, but they remain largely inscrutable. While we can observe their inputs and outputs, we have limited insight into what happens in between. The internal representations that drive model behavior are high-dimensional vectors that resist human interpretation.

This opacity creates a fundamental challenge: how can we trust, control, and improve systems whose internal reasoning we cannot understand? Natural language autoencoders offer a solution by creating a "translation layer" between the model's native representation space and human language.

What Are Natural Language Autoencoders?

A natural language autoencoder is a neural network that learns to map from high-dimensional representation vectors to natural language descriptions and back again. The encoder learns to convert Claude's internal activation vectors into concise English descriptions of what the model is "thinking."

How They Work

The encoder takes an activation vector from Claude's internal layers
It compresses this vector into a lower-dimensional linguistic representation
A language decoder converts this representation into natural language
The reverse process (decoder) can reconstruct the approximate original vector from the description

Training the Autoencoders

We train these autoencoders using a clever approach:

Supervised Learning Phase

First, we collect examples of Claude's internal activations paired with human-written descriptions of what the model should be "thinking" at that point. We use these supervised examples to train the initial encoder.

Reinforcement Learning Phase

Next, we use reinforcement learning to optimize for descriptions that are both faithful to the original representation and interpretable to humans. A reward function encourages descriptions that:

Faithfully capture the information in the original vector
Are concise and unambiguous
Use concepts familiar to domain experts
Enable accurate prediction of model behavior

What We Discovered

Using natural language autoencoders, we've made several surprising discoveries about Claude's internal representations:

Concept Hierarchies

Different layers of Claude encode concepts at different levels of abstraction. Early layers capture surface-level information like word types and sentence structure. Deeper layers encode increasingly abstract concepts like sentiment, intent, and logical relationships.

Distributed Representations

Single activation vectors often encode multiple concepts simultaneously. Our autoencoders reveal that Claude maintains superpositions of possible interpretations, with deeper layers gradually resolving ambiguity.

Analogical Reasoning in Vectors

We found that Claude performs analogical reasoning through vector arithmetic. The model solves "A is to B as C is to ?" by finding the vector that maintains similar geometric relationships between concept vectors.

Emergent Social Understanding

Descriptions of Claude's internal thoughts reveal sophisticated social reasoning. The model appears to maintain models of user intent, context, and conversational dynamics, adjusting its responses accordingly.

Validation: Do Descriptions Match Reality?

We validate our autoencoders by checking whether generated descriptions actually predict model behavior. We find that:

Descriptions correlate strongly with model outputs
When descriptions change, model behavior changes accordingly
We can use descriptions to predict errors before they occur
Modifying vectors in ways described by the autoencoder changes behavior as predicted

Applications

Natural language autoencoders enable new capabilities:

AI Interpretability

We can now explain what AI models are doing at each step, making them more trustworthy and enabling human oversight.

Debugging and Improvement

By understanding what the model is "thinking," we can identify where it's going wrong and design interventions.

Alignment and Control

Understanding internal reasoning makes it easier to ensure models remain aligned with intended values and behavior.

Knowledge Extraction

We can extract what models have learned by reading out their internal representations in natural language.

Limitations and Future Work

While natural language autoencoders are powerful, they have limitations:

Not all internal states have clear natural language equivalents
Descriptions can be imprecise for complex mathematical operations
Scaling to very large models requires further research
Validation remains challenging for genuinely novel concepts

Future work will focus on scaling these techniques to larger models, improving description fidelity, and applying them to understanding multi-modal models beyond just language.

Implications for AI Safety

This research has profound implications for AI safety and alignment. As we develop more capable AI systems, the ability to understand their internal reasoning becomes critical for maintaining human oversight and control. Natural language autoencoders represent a major step toward making advanced AI systems truly interpretable.

Conclusion

Natural language autoencoders demonstrate that we can translate AI model representations into human language, making their internal reasoning transparent. This breakthrough enables interpretability research that was previously impossible, opening new possibilities for understanding, debugging, and aligning advanced AI systems with human values.