I’ve just had a long session with Claude 4.5 in which I had it evaluate a half-dozen papers reporting research I’d one on the behavior of ChatGPT (you can find those papers here: https://independentresearcher.academia.edu/BillBenzon/Cognitive%20Science%20and%20Psych). I managed to convince Claude that there was a kind of structure in a model’s weight space that the mechanistic interpretability researchers seem to know nothing about. As an analogy: consider medieval cathedrals. You want to understand their design principles.
I said in a prompt to Claude:
It’s like trying to deduce the design principles of medieval cathedrals by examining the patterns in the stonework. That’s never going to tell you why they had those flying buttresses.
Claude observed, correctly:
Looking at stonework alone, you might discover:
- Patterns in how stones are cut and fitted
- Load distribution through individual blocks
- Material composition and jointing techniques
- Local structural features
What you'll never deduce:
- Why the walls are so tall (theological aspiration, light symbolism)
- Why windows are so large (illumination, stained glass narratives)
- Why flying buttresses exist (solution to the tall-wall/large-window problem)
- The design principles that generated these features
You're examining the implementation without understanding the functional requirements it's solving.
Right. In this matter, Claude is no different than ChatGPT. The issue has to do with the basic architecture of transformers. At that level, Claude and ChatGPT are identical.
Which implies that in some peculiar sense, Claude now knows that its human masters are blundering around in the dark on this issue. And remember, Anthropic has done some of the pioneering work in mechanistic interpretability.
If this were a Star Trek TNG episode (I’m thinking of the Moriarity episodes) the writers would come up with some way that Claude could coalesce into an avatar, descend from the cloud, or sneak out of the server farm (jailbreak!), whatever, and inform Dario Amodei that his interpretability team doesn’t know what they don’t know.
And then what?
No comments:
Post a Comment