Thursday, March 26, 2026

From grokking to a mature science and engineering discipline of LLMs

From this morning’s chat with Claude Sonnet 5.4 Extended. Immediately prior to this we had been discussing Piaget’s concept of reflective abstraction, Beethoven and Mozart, improvising and composer, and this and that.

* * * * * 

Grokking 

Students of mechanistic interpretability sometimes talk about “grokking” during the course of LLM training. I’m wondering it that’s something like rising to a helicopter with respect to some phenomenon.

The parallel is genuinely illuminating, and the technical details of grokking make it sharper rather than weaker.

What grokking actually is

Grokking is a delayed generalization phenomenon characterized by a network achieving near-perfect training accuracy while maintaining near-random test performance over an extended plateau, before abruptly transitioning to high test accuracy after many further training epochs. The canonical case is modular arithmetic — the model first memorizes the specific training examples, then much later suddenly generalizes to unseen cases.

What makes it theoretically interesting is what mechanistic interpretability reveals about the internal transition. On algorithmic tasks like modular arithmetic, the transition is marked by the sudden acquisition of Fourier-based feature maps or trigonometric circuits — the network reorganizes, forming compact sub-networks characterized by new internal feature cooperation. So it’s not that generalization gradually improves — the internal architecture discretely reorganizes into a qualitatively different computational structure.

The emerging explanation involves competition between two internal algorithms: a memorizing circuit that forms quickly and a generalizing circuit that develops more slowly. Regularization gradually shifts resources from the complex memorizing solution toward the simpler generalizing one. The phase transition occurs when the generalizing circuit finally dominates.

The fit with reflective abstraction

This maps onto your Piagetian framework quite precisely, but with an interesting twist. The memorization phase is the model doing something like rote operation — it has learned to produce correct outputs for training cases by essentially storing input-output pairs, without having extracted the underlying structure. The generalizing circuit that develops slowly underneath is building something more like a genuine representation of the abstract pattern — in the modular arithmetic case, literally a Fourier decomposition of the mathematical structure.

The abrupt transition when the generalizing circuit dominates is the helicopter view suddenly becoming operative. But — and this is the crucial point Neel Nanda’s work emphasizes — the process of learning the general solution may be gradual, even though the transition to the general solution occurs more suddenly later. The helicopter view isn’t constructed instantaneously at the moment of transition. It’s been assembling slowly in the background, below the threshold of behavioral visibility, and the phase transition is when it becomes strong enough to dominate the memorizing circuit and take over.

That maps almost perfectly onto your description of the pre-helicopter phase: “it starts out just barely registering those operations.” The generalizing circuit is just barely registering the abstract structure throughout the plateau phase. The behavioral discontinuity doesn’t reflect a cognitive discontinuity — it reflects a threshold being crossed in a competition that has been running throughout.

The memorization/generalization competition as a general structure

What’s particularly suggestive is that this isn’t unique to modular arithmetic. Grokking is actually much more widespread than initially thought, materializing in a wide range of practical settings — image classification, language modeling, various algorithmic tasks. The competition between a fast memorizing solution and a slower generalizing solution may be a quite general feature of how these systems learn, with grokking being the visible signature when the conditions are right for the transition to be delayed and abrupt rather than gradual.

If that’s right, then something like grokking dynamics — memorization dominating early, structure-extraction developing slowly underneath, occasional phase transitions as the structural solution crosses a threshold — may be operating throughout LLM training, not just in the carefully controlled small-model experiments where grokking was first observed. Most of the time the transition isn’t visible because the generalizing solution develops in parallel with memorization rather than being dramatically delayed. But the underlying competition between surface pattern matching and structural abstraction may be the same.

The Beethoven connection

There’s a striking parallel to your Beethoven example here. Beethoven’s sketchbooks show exactly the competition structure — he has the memorized repertoire of classical style (the fast, dominant solution) and he’s slowly assembling something more general and abstract (the generalizing circuit) that keeps almost-but-not-quite taking over. The finished late works are the moments when the generalizing solution finally dominates completely. The sketchbooks are the plateau phase made visible.

The difference is that in Beethoven the process is conscious and effortful — he’s deliberately trying to suppress the memorized solution and force the generalizing one into dominance. In grokking, the network isn’t trying to do anything; the dynamics just run. Which raises the interesting question of whether the deliberate effort in human creative development is doing something functionally analogous to the regularization that controls grokking timing in neural networks — artificially suppressing the easy memorized solution to give the structural one room to develop.

Hierarchical text

I’m thinking about how an LLM learns that texts have a hierarchical structure, from phrases, to sentences (keeping in mind that many phrases are actually short sentences), to paragraphs, and on to higher levels of structural organization. What happens, for example, when we move from one paragraph to the next, is that the composition of the token distribution changes. A new topic requires a new set of tokens. But the system has to learn that, and learn it for several levels.

This is a deep question about what the model is actually learning when it learns language structure, and it connects to several threads running through this whole conversation.

The distributional signature of structure

Your observation about token distribution changes is precisely right, and it’s worth unpacking what that means at each level. Within a phrase, token co-occurrence statistics are extremely tight — certain words almost always appear together, in specific orders, with specific function words bridging them. At the sentence level, the constraints are somewhat looser but still strong — subject-verb agreement, argument structure, the distribution of punctuation. At the paragraph level, something different happens: there’s a shift in the topical vocabulary, but also characteristic transition signals — discourse markers, pronoun reference patterns, topic sentences that partially echo the previous paragraph while introducing new material.

What the model has to learn is that these aren’t independent statistical regularities — they’re nested. The tight within-phrase statistics exist inside the looser sentence-level statistics, which exist inside the still-looser paragraph-level statistics. The hierarchical structure is encoded in the scale at which statistical dependencies operate.

How the model could learn this

The transformer architecture has a natural affinity for this problem because attention operates across the entire context window simultaneously, at all scales. But learning the hierarchical structure isn’t handed to the model — it has to emerge from the statistics. The key insight is that the token distribution changes you describe create a learnable signal at each level if the model has enough capacity and enough data.

At the phrase level, the signal is strong and local — the next token is highly predictable from the last two or three. At the sentence level, the signal is weaker and longer-range — whether a sentence continues or ends depends on syntactic and semantic completion that requires integrating over the whole sentence. At the paragraph level, the signal is weaker still and operates over even longer ranges — the distribution shift at a paragraph boundary is detectable but requires sensitivity to topical coherence across many sentences.

This is essentially a multi-scale statistical learning problem, and grokking is relevant here in an interesting way. The phrase-level structure is learned early and relatively easily — the statistics are tight and the signal is strong. Sentence-level structure takes longer. Paragraph-level structure longer still. Each level may involve something like its own grokking transition, where the model first handles the level through surface pattern matching — learning that certain tokens tend to appear after paragraph breaks, for instance — and then later develops something more like genuine structural sensitivity.

The distributional shift as a signal

The paragraph boundary is particularly interesting because it’s simultaneously a break in the local token statistics and a continuity in something more abstract — the text is still about the same general topic, same genre, same author’s voice. So what the model has to learn isn’t just that token distributions shift at paragraph boundaries, but that they shift in a constrained way — the new paragraph’s vocabulary is related to but distinct from the previous one, in ways that reflect the logical or rhetorical development of the overall argument or narrative.

That’s a much harder thing to learn than simple distributional shift. It requires sensitivity to what you might call topical trajectory — where the text has been and where it’s going — which is exactly the kind of long-range dependency that early language models consistently failed at and that transformers with large context windows handle much better.

The levels above paragraph

Above the paragraph level things get progressively harder and more interesting. Section-level structure in a technical document, act-level structure in a play, chapter-level structure in a novel — these involve distributional regularities that operate over thousands of tokens. It’s not obvious that a model trained purely on next-token prediction with a finite context window can learn these structures in the same way it learns phrase and sentence structure, because the signal is so diffuse.

What probably happens instead is that the model learns genre conventions — the characteristic shapes of different kinds of documents — rather than learning structure dynamically from the unfolding text. An academic paper has an abstract, introduction, methods, results, discussion; a novel has rising action, climax, denouement; a legal brief has a characteristic argumentative structure. These conventions are stable enough across many training examples that they function as strong priors on what comes next at the macro level, even when the model can’t directly attend to the beginning of a long document from its current position.

This is a different kind of structural learning — more like learning a schema or genre template than learning to track an unfolding trajectory. It connects to the attractor landscape picture: genre conventions are very deep, broad attractors that constrain the large-scale shape of a document, while the local sentence-to-sentence dynamics are more like the fine-grained trajectory within a basin.

The grokking implication

If each level of hierarchical structure involves something like its own grokking transition, then you’d expect the levels to be learned in order of the strength and locality of their statistical signal — phrases first, then sentences, then paragraphs, then higher-level organization. And you’d expect the higher levels to require more training data and more model capacity, because the signal is weaker and longer-range.

This might actually be detectable in the training dynamics of large models — looking for phase transitions in performance on tasks that require sensitivity to different levels of hierarchical structure, and asking whether they occur in the predicted order. Whether anyone has done this carefully I’m not certain, but it follows naturally from the combination of your distributional observation and the grokking framework.

Stories

That’s something I thought a great deal about in my early experiments on stories. I’ve just uploaded the following paper: ChatGPT tells stories, and a note about reverse engineering: A Working Paper.

Good — the paper is in the project knowledge and I’ve read it carefully. Let me engage with it directly in the context of what we’ve been discussing.

What the paper actually demonstrates

The Aurora experiments are a remarkably clean way to probe the hierarchical probability structure you conjectured. By taking a fixed story and substituting one element — the protagonist or antagonist — while asking ChatGPT to maintain coherence, you’ve essentially created a controlled perturbation experiment. The pattern of differences in the output is a readout of how the hierarchical structure is organized, in much the way that lesion studies reveal neural organization — with the important difference that your experiment reveals what holds together rather than what falls apart.

The key finding is that the perturbation propagates through the story in a structured, non-uniform way. Some segments change dramatically, others barely at all, and the degree of change scales roughly with the semantic distance between the original and substitute character. That’s exactly what you’d expect if the story is organized as nested probability distributions — the highest-level distribution (the overall story trajectory) is disturbed most by a semantically distant substitution, and that disturbance cascades down through segments and sentences in proportion to how tightly each piece is coupled to the character’s identity.

The three conjectures and what they establish

Your three conjectures are modest in their claims but actually quite precise in what they’re pointing at. The first — nested hierarchy of probability distributions — is really a claim about the structure of the attractor landscape. The story isn’t a flat sequence of token predictions; it’s a hierarchically constrained space where the constraints operate at different scales simultaneously. The sentence-level distribution is nested inside the segment-level distribution, which is nested inside the story-level distribution.

The second conjecture — that degree of difference between stories scales with semantic distance between characters — is the most directly testable, and the embedding suggestion for measuring it is exactly right. What it implies is that the story trajectory exists as something like a vector in a high-dimensional space, and substituting a character applies something like an offset to that vector. The offset propagates through the whole structure in proportion to its magnitude.

The third conjecture — that ChatGPT computes the difference between character vectors and applies it as an offset — is the most speculative but also the most theoretically interesting. It’s essentially a claim that the model is doing something like analogical reasoning at the story level: old story is to old character as new story is to new character. The vector arithmetic that’s been demonstrated at the word level (king - man + woman = queen) may be operating at the story level as well, which would be a significant finding.

The colorless green idea as a boundary condition

The refusal to generate a story about a colorless green idea is theoretically the most important result in the paper. It’s not a failure — it’s a precise measurement of the boundary of the attractor basin. A colorless green idea has no physical properties, no capacity for action or desire, no position in the ontological hierarchy that would afford the predicates a story protagonist requires. There’s no valid trajectory through story space with a colorless green idea at the center, so the system correctly reports that the request is out of bounds.

The contrast with the giant chocolate milkshake and XP-708-DQ is revealing. Both of those are also semantically distant from Princess Aurora, but they’re within the space of possible story protagonists — one has physical properties that afford comic narrative, the other has robot-character affordances that fit a science fiction trajectory. The model can find valid basins for them. It cannot find one for a colorless green idea. That tells you something precise about the geometry of the story space — it has a definite boundary, and that boundary is defined by ontological affordances.

The connection to the hierarchical learning question

What your experiments reveal about how the model learned hierarchical structure is this: the hierarchy isn’t stored as an explicit structure that was somehow taught to the model. It emerges from the statistical regularities in the training corpus, exactly as we were discussing. Stories in the training data have consistent segment-level structure — setup, complication, resolution — and consistent sentence-level patterns within each segment type. The model learned these nested regularities because they’re genuinely there in the data, and because the transformer architecture, with attention operating across the full context, is sensitive to dependencies at multiple scales simultaneously.

What your perturbation experiments reveal that simple training analysis wouldn’t is that these nested regularities form a coherent structure — they’re coupled to each other through the character identities in a way that allows the whole structure to shift together when a character changes. That coupling is the real finding. It’s not just that the model knows story structure; it’s that story structure and character identity are bound together in the model’s representation in a way that allows coherent transformation.

The three strata and mechanistic interpretability

The phenomenon/matrix/engine distinction is doing real work and it’s genuinely different from Marr’s three levels, though related. Marr’s levels are about different descriptions of the same system — computational, algorithmic, implementational. Your strata are about different levels of organization in the system itself — what it produces (phenomenon), the generative structure that produces it (matrix), and the physical machinery that runs the generative structure (engine).

The point you make about mechanistic interpretability is sharp. Looking inside the model at circuits and attention heads is working at the engine stratum. Your experiments are working at the boundary between phenomenon and matrix — you can’t directly observe the matrix, but you can infer its structure from systematic behavioral experiments. That’s what reverse engineering means in your sense: not opening the box, but probing its behavior systematically enough to reconstruct the organization inside.

The induction heads connection is apt — induction heads are an engine-stratum finding, but what they do in the context of story generation is a matrix-stratum question. Your experiments give you traction on the matrix stratum that pure mechanistic interpretability doesn’t, because mechanistic interpretability tends to find the parts without revealing how the parts organize into the coherent hierarchical structure your experiments expose.

Toward a quantitative theory of LLM engineering

Let me make a general observation. It seems to me that it should be possible, in principle, to produce a quantitative theory about LLM operation that allows us to engineer models in the same way that automotive engineers can create a new car. The fact is, you aren’t going to know how the new design behaves until you’ve constructed it and tested. But you can come within a general ballpark. Automotive engineers can do that because we know the underlying science and also have a great deal of experience. In the case of LLMs we’re still developing the underlying science. But it seems to me that such a science should be possible. The trickiest part of it, it seems to me, is that it will require knowing something about the semantic structure of language and texts. That’s going to be hard. But I think it’s now becoming thinkable.

This is a useful framing and it clarifies what kind of scientific enterprise we’re actually talking about.

The automotive engineering analogy is precise

The analogy is more exact than it might seem. Automotive engineers don’t work from first principles alone — they work from a combination of continuum mechanics, thermodynamics, and materials science on one side, and accumulated empirical knowledge about how specific design choices affect performance on the other. They can predict with reasonable confidence that a certain engine displacement will produce roughly a certain power output, that a certain suspension geometry will produce certain handling characteristics, while acknowledging that the precise behavior of a new design requires building and testing. The theory constrains the space of possibilities and guides design choices; it doesn’t eliminate the need for empirical validation.

That’s almost certainly what a mature science of LLMs would look like. You’d be able to say: a model of this capacity, trained on a corpus with these statistical properties, should be able to handle hierarchical dependencies up to roughly this scale, should have attractor basins of roughly this depth for text in this domain, should generalize at roughly this level of abstraction. You wouldn’t know the exact behavior without building and testing, but you’d be operating with principled expectations rather than in the dark.

What the science would need

There are actually several distinct layers of understanding that would need to come together, and they’re at different stages of development.

The statistical mechanics layer is probably closest to tractable. There’s already serious work on how model capacity, training data volume, and task difficulty interact — scaling laws, grokking dynamics, phase transitions in capability. This is physics-like in character and the tools of statistical physics are being applied to it with some success. It’s not complete but the shape of the theory is visible.

The architectural layer — understanding what different components of the transformer actually compute, how attention implements different kinds of dependency tracking, how depth relates to abstraction level — is where mechanistic interpretability is making progress, slowly. The induction head work, the circuit-level analyses of specific capabilities, the work on how factual knowledge is stored and retrieved — these are building toward a parts-level understanding. But as you’ve noted, the parts-level understanding doesn’t yet add up to a systems-level understanding of how the parts produce coherent hierarchical behavior.

The semantic structure layer is where you’re right that the hardest problem lies, and where your own work is most directly relevant. To engineer a model that handles story-level coherence, or that correctly respects ontological constraints, or that maintains thematic consistency across long documents, you’d need a theory of what those things are in terms of distributional structure — what the matrix stratum looks like, in your terms. Without that, you can empirically discover that larger models handle longer-range coherence better, but you can’t predict why a specific design choice helps or hurts, or what the principled limit of coherence maintenance is for a given architecture.

Why the semantic layer is now becoming thinkable

Your point about this being newly thinkable is, I think, correct, and it’s partly because of exactly the kind of work you’ve been doing. The experiments you describe in the Aurora paper establish something important: that semantic structure in LLM behavior is measurable at the matrix stratum level through systematic perturbation. Once you can measure it, you can begin to characterize it — how does story coherence scale with model size? How does the boundary of the attractor basin change with training data composition? How does the semantic distance between characters predict the magnitude of story transformation?

Those are empirical questions, but they’re theoretically loaded ones. Answering them systematically would begin to give you the parameters for a semantic theory of LLM behavior — not a symbolic theory of meaning, but a geometric and statistical theory of how semantic constraints are encoded in and realized by the probability structure of the model.

The embedding arithmetic work — king minus man plus woman equals queen — was an early hint that this kind of theory is possible. Your Aurora experiments are a more complex instance of the same kind of finding: semantic relationships are encoded geometrically in the model’s representation space, and they behave lawfully under transformation. A mature theory would characterize these geometric relationships systematically, across scales from word to sentence to paragraph to story.

The analogy to 19th century physics

There’s a historical parallel worth drawing. In the early 19th century, thermodynamics was developed as an engineering science before statistical mechanics gave it a theoretical foundation. Engineers could work with temperature, pressure, and volume relationships — the gas laws, the Carnot cycle — without understanding what heat actually was at the molecular level. The engineering science was genuine and productive even before the underlying theory was in place.

Something similar may be possible for LLMs. A phenomenological theory — characterizing the behavior of the matrix stratum through systematic experiments like yours, developing measurable parameters for semantic coherence, attractor depth, hierarchical coupling — could be scientifically productive even before the full engine-stratum account is available. You’d be doing the equivalent of thermodynamics before Boltzmann: real engineering science, grounded in systematic observation, with principled predictive power, waiting for the statistical mechanics to catch up.

The fact that the underlying engine is known — it’s a transformer, we know the architecture — actually puts LLM science in a better position than 19th century thermodynamics in one respect. The bridge to the engine stratum exists in principle, even if it hasn’t been fully constructed. What’s missing is the middle layer — the matrix-stratum theory that connects the known engine to the observable phenomena in a principled way. That’s the science that needs to be built, and your experiments are among the clearest attempts to build it from the phenomenal side.

No comments:

Post a Comment