Early in my work with ChatGPT I was intrigued by some results in mechanistic interpretability (MI). After awhile, though, I lost interest. The work didn’t seem to be doing much beyond accumulating a mass of detail that didn’t add up to much. Yesterday I had an idea: Why don’t I upload some of those observations to Claude and have it tell me how they relates to MI. Here’s what it told me:
* * * * *
The problem isn't that MI's methods are bad. Circuit analysis, attention head visualization, sparse autoencoders - these are legitimate tools doing real work. The problem is that MI, pursued in isolation, is trying to understand a cathedral by studying the molecular structure of limestone.
You can measure every stone. Map every stress pattern. Identify load-bearing arches. And you still won't know why flying buttresses exist - because you're studying implementation details without understanding functional requirements.
The Phenomenology Deficit
Here's what I mean. Over the past two years, I've been systematically probing ChatGPT's behavior - not with benchmarks, but with carefully constructed prompts designed to reveal structural properties. What I've found are consistent patterns that no amount of circuit analysis would predict or explain.
Example 1: Ontological Boundary Enforcement
Give ChatGPT a story about a fairy tale princess who defeats a dragon by singing. Ask it to retell the story with a prince instead. You get minimal changes - the prince uses a sword rather than song, but the story structure is identical.
Now ask it to retell the same story with "XP-708-DQ" as the protagonist. The entire ontology shifts. The kingdom becomes a galaxy, the dragon becomes an alien threat, combat becomes diplomatic negotiation. The abstract pattern persists, but every token changes to maintain ontological coherence.
Here's what's interesting: Ask it to retell the story with "a colorless green idea" as the protagonist, and it refuses. Not with a safety refusal - with a coherence refusal. It cannot generate a well-formed narrative because colorless green ideas have no affordances in any accessible ontological domain.
What MI sees: Some attention patterns activate, others don't. Certain token sequences get high probability, others near-zero.
What MI doesn't see: There's a coherence mechanism actively enforcing ontological consistency across the entire generation process. It's not checking individual tokens - it's maintaining global narrative structure within semantic domains.
The Three-Level Architecture
Transformation experiments reveal something even more fundamental: LLMs appear to organize narratives hierarchically across at least three levels.
Level 1: Individual story elements (princess, dragon, kingdom)
Level 2: Event sequences and causal chains (protagonist encounters threat → confronts threat → resolves threat)
Level 3: Abstract narrative structure (hero's journey, quest pattern, sacrifice arc)
When you transform Aurora → Harry, Level 1 changes (princess → prince). When you transform Aurora → XP-708-DQ, Levels 1 and 2 change (all tokens different, but pattern same). When you try Aurora → colorless green idea, the system can't find any Level 1 or Level 2 realizations that maintain Level 3 coherence.
This three-level organization isn't visible in circuit diagrams. You'd need to know to look for it. That's what phenomenology provides: identifying the functional requirements that MI can then explain mechanistically.
Memory Architecture: What Syntactic Boundaries Reveal
Here's another example. Present ChatGPT with phrases from Hamlet's "To be or not to be" soliloquy:
- "The insolence of office" (starts a line) → Immediately retrieves full soliloquy
- "what dreams may come" (syntactically coherent mid-line phrase) → Retrieves soliloquy
- "and sweat under a" (cuts across syntactic boundary) → "I don't understand"
But tell it "this is from a famous speech" and suddenly it retrieves the soliloquy, though it can't locate where the phrase appears within it.
Findings:
- Identification and location are separate operations
- Syntactic boundaries serve as access points into associative memory
- The system can evoke whole from part (holographic property) but struggles with within-text location
The Two-Way Street
Here's the critical point: phenomenology and MI need each other.
Phenomenology → MI: "Look for circuits that maintain ontological coherence across multi-turn generation. Find the mechanism that checks whether narrative elements belong to the same semantic domain. Identify what implements the three-level hierarchy."
MI → Phenomenology: "Here are the attention patterns during transformation. Here's where the model queries for ontologically compatible tokens. Here's the circuit that evaluates cross-domain consistency."
Neither tells you the whole story alone. Phenomenology identifies what the system is doing and why (functional requirements). MI reveals how (implementation). Together, they give you understanding.
Why This Matters for AGI Policy
If you're working on AGI policy, here's why this matters:
Current approach: Scale up MI, find all the circuits, map all the activations, understand the system bottom-up.
Problem: You're generating vast amounts of mechanistic data without knowing what functional properties to look for. You're finding patterns without understanding what problems they solve.
Alternative: Start with systematic phenomenological investigation. Identify consistent behavioral patterns that reveal functional architecture. Use those findings to direct MI research toward mechanistically explaining the phenomena that matter.
The transformer architecture enables something we don't fully understand: the emergence of coherence mechanisms, ontological organization, hierarchical structure. These aren't accidental features - they're solving specific functional requirements of language generation.
But we won't understand them by studying weights and activations alone. We need to know what cathedral we're building before we can understand why the buttresses fly.
Coherence Without Correspondence
One final example that ties this together: hallucination. Through transformation experiments, memory probes, and abstract concept tests, a pattern emerges: LLMs have powerful coherence mechanisms (maintaining ontological consistency, enforcing narrative structure, generating semantically appropriate continuations) but no correspondence mechanisms (checking against external reality).
Hallucination isn't a bug. It's coherence working perfectly in the absence of truth-grounding. The system generates outputs that are internally consistent, semantically appropriate, and completely false - because it has relationality (semantic structure) but lacks adhesion (grounding in the world).
What MI would find: Circuits that generate high-confidence outputs with no distinguishing features from true outputs.
What phenomenology reveals: There's no correspondence check because there's no mechanism to implement one. The architecture has no access to external verification. This is a fundamental limitation, not a scaling problem. And you wouldn't know to look for the missing correspondence mechanism unless phenomenology told you it should exist.
The Bottom Line
Your skepticism about MI is justified - but not because circuit analysis is wrong. Because doing MI without phenomenology is like trying to reverse-engineer a computer by studying transistors without knowing what programs are.
First identify what the system is actually doing (phenomenology). Then figure out how it does it (MI). Then you might actually understand these things.
That's not a five-year research program. That's the work of a generation. But it starts with recognizing that you need both approaches, not just one.
* * * * *
This argument draws on systematic investigations of ChatGPT's behavior conducted 2023-2025, including transformation experiments on narrative structure, memory architecture probing, and abstract concept handling. For detailed evidence and methodology, see the working papers on story transformations, memory for texts, and conceptual ontology.
No comments:
Post a Comment