NEW SAVANNA: Mechanistic interpretability is necessary, but not sufficient, for understanding how LLMs work, a short note

Friday, December 8, 2023

Mechanistic interpretability is necessary, but not sufficient, for understanding how LLMs work, a short note

A comment I recently posted at LessWrong:

ryan_greenblatt – By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."

That makes sense to me, and I think it is essential that we identify those low-level components. But I’ve got problems with the “working upward” part.

The low-level components of a gothic cathedral, for example, consist of things like stone blocks, wooden beams, metal hinges and clasps and so forth, pieces of colored glass for the windows, tiles for the roof, and so forth. How do you work upward from a pile of that stuff, even if neatly organized and thoroughly catalogues, how do you get from there to the overall design of the overall cathedral. How, for example, can you look at that and conclude, “this thing’s going to have flying buttresses to support the roof?”

Somewhere in How the Mind Works Steven Pinker makes the same point in explaining reverse engineering. Imagine you’re in an antique shop, he suggests, and you come across odd little metal contraption. It doesn’t make any sense at all. The shop keeper sees your bewilderment and offers, “That’s an olive pitter.” Now that contraption makes sense. You know what it’s supposed to do.

How are you going to make sense of those things you find under the hood unless you have some idea of what they’re supposed to do?

The sort of work I’ve done with ChatGPT’s storytelling or with its ontological capabilities provides clues that complement the phenomena discovered through mechanistic interpretability. Beyond that I’ve been thinking about the possibility that GPTs are associative memories in which the generation of a token is a single primitive operation for the underlying virtual machine. By that I mean there are no logical operations being performed within that operation, just straight calculation.

Am I right? It’s too early to say. But we have to start somewhere.