AI models like Claude talk in words but think in numbers

In this study we train Claude to translate its thoughts into human-readable text. By examining the internal representations that drive language model outputs, we gain insight into how AI systems process information and make decisions.

The Hidden Language of AI

Language models like Claude operate fundamentally differently from how humans think. While humans manipulate words and concepts consciously, language models process information as vectors—high-dimensional numerical representations that encode meaning in ways that are opaque even to their creators.

This raises a fundamental question: can we teach AI models to become interpreters of their own thinking? Can we extract the underlying numerical thoughts and translate them into human language that reveals what the model "understands"?

The Research Approach

We developed a novel approach where we train auxiliary networks to translate Claude's internal activation patterns into natural language descriptions. This process involves:

Capturing Internal Representations

Extracting activation vectors from intermediate layers of the model
Identifying which layers contain the richest semantic information
Understanding how representations evolve as information flows through the network

Training Translation Networks

We train separate neural networks to map from Claude's internal vectors to natural language descriptions. These translation networks learn to interpret what the model's representations "mean" in human terms.

Validating Interpretations

We validate that our translations are accurate by checking whether the descriptions correspond to the model's actual behavior and outputs. If our interpretation is correct, it should predict what Claude will do.

Key Findings

This research reveals several surprising insights about how Claude thinks:

Hierarchical Concept Building

Lower layers of the network process surface-level syntactic information, while deeper layers build abstract semantic concepts. We can identify the exact points where the model shifts from thinking about words to thinking about meanings.

Multi-faceted Representations

Single activation vectors often encode multiple concepts simultaneously. The model represents ambiguous information by maintaining superpositions of possible meanings until later layers resolve the ambiguity.

Analogical Reasoning

Claude's internal representations reveal how it performs analogical reasoning by finding geometric relationships between concepts. The model solves "A is to B as C is to ?" by vector arithmetic in its internal space.

Implications for AI Safety

This work has important implications for understanding and controlling AI behavior:

Transparency: By understanding how models think internally, we can verify that they're reasoning correctly
Alignment: We can identify where a model's internal goals or values diverge from intended behavior
Robustness: Understanding internal representations helps us design defenses against adversarial examples and manipulation

The Challenge of Cross-Domain Translation

While we've made progress on Claude's internal representations, translating from numbers to words remains challenging. The internal language of neural networks is radically different from human language, and not all aspects of model thinking may be expressible in words.

Nevertheless, this research demonstrates that AI systems are not black boxes—their thinking can be made at least partially transparent through systematic investigation of their internal representations.

Conclusion

Language models think in numbers, but we can train them to explain their thoughts. This work opens new possibilities for understanding, interpreting, and ultimately controlling the behavior of advanced AI systems. As we develop more powerful AI, the ability to understand how these systems think becomes increasingly important for safety and alignment.