In this study we train Claude to translate its thoughts into human-readable text. By examining the internal representations that drive language model outputs, we gain insight into how AI systems process information and make decisions.
The Hidden Language of AI
Language models like Claude operate fundamentally differently from how humans think. While humans manipulate words and concepts consciously, language models process information as vectors—high-dimensional numerical representations that encode meaning in ways that are opaque even to their creators.
This raises a fundamental question: can we teach AI models to become interpreters of their own thinking? Can we extract the underlying numerical thoughts and translate them into human language that reveals what the model "understands"?
The Research Approach
We developed a novel approach where we train auxiliary networks to translate Claude's internal activation patterns into natural language descriptions. This process involves:
Capturing Internal Representations
- Extracting activation vectors from intermediate layers of the model
- Identifying which layers contain the richest semantic information
- Understanding how representations evolve as information flows through the network
Training Translation Networks
We train separate neural networks to map from Claude's internal vectors to natural language descriptions. These translation networks learn to interpret what the model's representations "mean" in human terms.
Validating Interpretations
We validate that our translations are accurate by checking whether the descriptions correspond to the model's actual behavior and outputs. If our interpretation is correct, it should predict what Claude will do.
Key Findings
This research reveals several surprising insights about how Claude thinks:
Hierarchical Concept Building
Lower layers of the network process surface-level syntactic information, while deeper layers build abstract semantic concepts. We can identify the exact points where the model shifts from thinking about words to thinking about meanings.
Multi-faceted Representations
Single activation vectors often encode multiple concepts simultaneously. The model represents ambiguous information by maintaining superpositions of possible meanings until later layers resolve the ambiguity.
Analogical Reasoning
Claude's internal representations reveal how it performs analogical reasoning by finding geometric relationships between concepts. The model solves "A is to B as C is to ?" by vector arithmetic in its internal space.
Implications for AI Safety
This work has important implications for understanding and controlling AI behavior:
- Transparency: By understanding how models think internally, we can verify that they're reasoning correctly
- Alignment: We can identify where a model's internal goals or values diverge from intended behavior
- Robustness: Understanding internal representations helps us design defenses against adversarial examples and manipulation
The Challenge of Cross-Domain Translation
While we've made progress on Claude's internal representations, translating from numbers to words remains challenging. The internal language of neural networks is radically different from human language, and not all aspects of model thinking may be expressible in words.
Nevertheless, this research demonstrates that AI systems are not black boxes—their thinking can be made at least partially transparent through systematic investigation of their internal representations.
Conclusion
Language models think in numbers, but we can train them to explain their thoughts. This work opens new possibilities for understanding, interpreting, and ultimately controlling the behavior of advanced AI systems. As we develop more powerful AI, the ability to understand how these systems think becomes increasingly important for safety and alignment.