As I say down there in my interaction with Claude (Sonnet 4.6 Extended), once I’d been working with ChatGPT for a while I began to think that the mechanisms of basic human speech may be more like those of LLMs than those of a fully articulated grammar and semantics in a symbolic computational language system. Fully articulated grammar and semantics may require specific instruction and practice, like you get when learning how to write. Fully articulate language is more painstaking, more difficult, but also supports more complex arguments and constructions.
This interaction starts with a passage from a long conversation I had with ChatGPT that stretched over the last 10 days or so. Much of that conversation was based on articles of mine that I had uploaded to to. One of those articles had a passage from F. C. Bartlett's Remembering: A study in experimental and social psychology (1932). That comes up in the long initial prompt I give Claude.
Further down in the dialog I present the case of a writing student who was skilled in color commentary on hockey games and whose writing style reflected that linguistic style. Real-time sports commentary is a very specialized and exacting kind of language that cannot afford the niceties of written language.
* * * * *
I’ve just been thinking about oral narrative. Recall these paragraphs (by ChatGPT) from one of the documents you’ve just reviewed in this project:
The temptation is to assume that because we can describe stories with an explicit grammar, the underlying system that produces or remembers them must itself contain something grammar-like in explicit form. But your Bartlett example suggests otherwise. The witness does not appear to possess random-access symbolic control over an event list. Instead, the testimony seems to unfold by re-running a trajectory. The person can accelerate through the earlier parts, but not leap arbitrarily into the middle. That is a very different computational picture.
The music example reinforces the point. When a piece is not deeply overlearned, one often cannot simply index into bar 73 as though one were reading a score in a database. One has to begin at the start and let the sequence reconstitute itself. That strongly suggests a memory system organized around sequential reactivation of structured trajectories, not random-access retrieval of explicit symbolic units.
Bartlett, in his classic book, Remembering, was talking about native witnesses in Africa. That is, he was talking about preliterate people. They were just recounting the events of the day. “But what about oral story-tellers,” I asked myself, “how do they memorize their tales? Structured trajectories or explicit symbolic units?”
I’m familiar with Albert Lord, The Singer of Tales, though I haven’t looked at it in a while. But I’ve got a copy of David C. Rubin, Memory in Oral Traditions: The Cognitive Psychology or Epic, Ballads, and Counting-out Rhymes (Oxford 1995). I zipped through the text to Chapter 7, “Basic Observations on Remembering.” I found this (p. 155):
In oral traditions, overlearning commonly occurs to a much greater extent than it does in the laboratory. A favorite song can be sung hundreds of times. What overlearning does, according to the model developed to explain laboratory interference, is to make the song into a unit, easy to cue as a whole and resistant to interference from other units. This chunking of items into wholes is a way to look at the organization of memory and a way to look at the building of larger units in expertise.
And then, in the middle of the next paragraph: “Once the song is begun, each word output provides cues for later words, limiting the meaning...” That almost sounds like he’s describing a forward pass through an LLM.
Then I hit paydirt in the next chapter, “A Theory of Remembering for Oral Traditions.” The opening is promising:
Oral traditions, like all oral language, are sequential. One word follows another as the physical effects of the first word are lost. As the song advances, each word uttered changes the situation for the singer, providing new cues for recall and limiting choices. [...] Pieces from oral traditions are recalled serially, from beginning to end. What is recalled early in the piece can be used to cue later recall; the "running start" provides "extra stimulation" or "reminders," increasing cue-item discriminability.
But things get really interesting when Rubin reports the result of an experiments where he asked undergraduates to recall important texts which they might have learned. Rubin describes the experiment this way:
The first set of examples is the recall of culturally important material such as Psalm 23 and the Preamble to the Constitution of the United States, for which there is an implicit demand characteristic to recall the material accurately or not at all (Rubin, 1977). Each of the 50 columns in Figure 8.1 show the recall of 1 of 50 undergraduates, who recalled at least one word of the Preamble. Each row represents recall for one word. A dark line in a column means that the word labeling the row was recalled. The columns are ordered so that the data from the undergraduate who recalled the most are in the leftmost column and the data from the undergradu- ate who recalled the least are in the rightmost column. The rows are in the order in which the words appear normally in each text.
Figure 8.1 is a little tricky, so I’m not going to try uploaded a screen shot. But I’ll give you Rubin’s basic description of what the figure reveals:
The first observation to note is the regularity of the data. Figure 8.1 gives the recalls of 50 individuals for 52 words, not the averages of recalls from groups of individuals or groups of words. There was no control over the learning or practice of the material or over the length or contents of the retention interval. Yet the figure is remarkably orderly. People who recall about the same amount recall the same words. If the number of words a person recalls and the rank ordering of words from most to least likely for the group from which the person was drawn is known, exactly which words that person recalled can be predicted with an accuracy of 95% for Figure 8.1.
Because the conditions of learning and retention varied, there must be something in the material, in the process used to recall it, or in the general cultural attitudes to it that makes different people behave the same way.
The results from the experiment with Psalm 23 are even more interesting. Here’s the text of the Psalm for reference (with verses numbered):
1 The Lord is my shepherd; I shall not want.
2 He maketh me to lie down in green pastures: he leadeth me beside the still waters.
3 He restoreth my soul: he leadeth me in the paths of righteousness for his name's sake.
4 Yea, though I walk through the valley of the shadow of death, I will fear no evil: for thou art with me; thy rod and thy staff they comfort me.
5 Thou preparest a table before me in the presence of mine enemies: thou anointest my head with oil; my cup runneth over.
6 Surely goodness and mercy shall follow me all the days of my life: and I will dwell in the house of the Lord for ever.
Rubin remarks: “...not all the undergraduates started with the first words, and they were more likely to start again at the beginning of a new sentence or intonation unit after they stopped.”
Eyeballing the figure, here are some of the things I observe: 1) A number of subjects stop with the word “soul” in verse 3 and start back up with the first word of verse 4, “Yea” and go on from there. 2) Some of those who don’t start from the beginning start with verse 4. 3) Some stop at the end of verse 4 and then start back up later (e.g. “annointest my head...”) or even not at all. It’s a complicated picture but memory does seem to be grouped into segments, not word by word.
I have no idea where things have gone in this line of investigation since 1995, the date of Rubin’s book. But if I were in that business, I’d be thinking in terms of complex dynamics and trajectories, and I’d been looking at LLMs as a possible model.
This is a wonderful extension of the trajectory discussion already running through the project documents. Rubin's findings slot in almost perfectly — and in some ways they sharpen the picture considerably.
What Rubin adds that Bartlett doesn't
The Bartlett material (already central in the project conversations) establishes the sequential regeneration point: memory of events seems to run forward as a trajectory, not index arbitrarily into a symbolic list. But Bartlett's witnesses were recalling episodic material — the events of a day. Rubin moves you into highly practiced, culturally canonical sequences. That's a different and in some ways more interesting case.
The overlearning observation is key. Rubin is saying that at sufficient practice, a song or psalm becomes a chunked unit — cued as a whole, resistant to interference. That sounds like the stable attractor picture your project conversations keep returning to. The piece isn't stored as a list of words; it's encoded as a basin that, once entered, constrains forward motion powerfully. The phrase "each word output provides cues for later words" really does describe autoregressive generation — but it also describes a trajectory rolling down a valley where the initial conditions are already heavily constraining.
The Psalm 23 data are especially telling
The 95% predictability result is striking. Given only how many words a person recalls, you can predict which words with 95% accuracy. That means the forgetting isn't random — it respects something structural in the material. And your eyeball observations confirm this: people stop and restart at syntactic/intonation unit boundaries (end of verse 3, beginning of verse 4). They're not losing individual words randomly; they're losing or retaining segments.
This is consistent with a hierarchical trajectory picture. The top-level organization is something like: intonation/verse units sequentially ordered. Within each unit, local cuing drives forward production. When a unit boundary is reached, the system can fail to re-enter the next unit even if it knows the material in principle — which is why some subjects jump from verse 3 directly to verse 4, skipping nothing within units but occasionally failing at the transitions between them. The unit is entered as a whole or not at all.
Where this sits relative to the LLM picture
Your project conversations develop the idea that narrative structure in LLMs lives in activation-space geometry — not as an explicit stored grammar but as a field of constrained trajectories. The Rubin data suggest something homologous in human memory for overlearned verbal sequences. The "basin" metaphor you've been developing applies directly: once the singer/reciter enters the opening of Psalm 23, the attractor structure does most of the work. The challenge is at basin entry and at transitions between sub-basins (verse units), which is exactly where Rubin's subjects fail.
The Lord/Singer of Tales connection is also worth pressing when you return to it. Lord's central finding — that oral formulaic composition is generative, not reproductive; the singer produces a variant from formulas and themes, not a memorized script — actually fits the "latent field of possible trajectories" framing better than it fits either strict memorization or explicit grammar. Lord thought he was arguing against memorization and for creativity. But what he was probably observing is a system with rich attractor structure that generates strongly homologous variants, which is precisely what your story experiments with LLMs revealed.
The open research question you identify
You're right that framing this in terms of complex dynamics and trajectories is the natural next move, and that LLMs are a possible model system. What makes LLMs especially useful here is that you can actually probe the attractor landscape — something you can't do directly with a human singer. The systematic prompting experiments described in the project documents are in effect doing for narrative schemas what someone studying oral tradition would want to do for epic formulas: mapping the basin structure from the outside, through behavioral outputs, without needing to open the skull.
The line from Bartlett's African witnesses → Rubin's undergraduates and Psalm 23 → Lord's Yugoslavian guslars → LLM forward passes is not a metaphor chain. It may be pointing at the same underlying computational architecture realized in very different substrates.
Once I’d been working with ChatGPT for a while I began to think that the mechanisms of basic human speech may be more like those of LLMs than those of a fully articulated grammar and semantics in a symbolic computational language system. Fully articulated grammar and semantics may require specific instruction and practice, like you get when learning how to write. Fully articulate language is more painstaking, more difficult, but also supports more complex arguments and constructions.
I’ll give you an interesting example. Back when I was on the faculty at RPI I took on a graduating senior for an independent study course in basic composition. Administratively that was VERY unusual, but he needed the credit to graduate and my chairman asked me to help him out, which I was glad to do.
I found out that he did color commentary on hockey games for the college radio station. As you can imagine, doing real-time commentary on sports events is very specialized and demanding. Where you are doing play-by-play or color commentary, you don’t have time to think and play. You have to comment on events as they unfold. It’s almost as though your voice is directly cued by what you see (and accurately observing what’s happening is a learned skill). So you have to have a lot of “stuff” queued up and ready to go.
Once I found out his background, I told him to write like he was delivering color commentary. It worked. He wrote fluently, and well. But it is in the oral style of color commentary, not standard written English.
This example is beautifully precise, and it connects to the project themes at multiple levels.
The commentator as a limiting case
Sports commentary — especially play-by-play but also color — is probably the purest instance of speech production being driven almost entirely by external cuing plus ready queued material. There's no planning horizon to speak of. The perceptual event triggers a trajectory, and the trajectory runs. In Rubin's terms, what the commentator has is a vast repertoire of deeply overlearned chunks — not just phrases but situation-response pairs — that have been practiced to the point where they fire automatically given the right cue. A breakaway, a penalty, a missed shot: each is a basin entry point, and the commentary rolls out from there.
What's striking about your student is that he had developed this capacity to a high degree in one channel (radio, hockey) but it hadn't transferred to the written channel. The underlying linguistic competence was clearly there — he could produce fluent, well-constructed language. But the production mechanism for writing was blocked or underdeveloped, presumably because writing requires a different kind of initiation: you have to cue yourself, from the inside, in the absence of a live perceptual stream driving you forward.
The oral/written distinction as a difference in production mechanism, not competence
What you're pointing at goes deeper than a style difference. It suggests that oral and written language production may engage fundamentally different cognitive architectures — or at least, different regimes of the same architecture.
Oral speech, especially in real-time high-demand situations, is essentially forward only. Each utterance cues the next. Hesitation is costly. Revision is nearly impossible. The system has to be pre-loaded with chunks ready to deploy. This maps very naturally onto the trajectory picture: a sequence of basins, each entered from the previous one, with little ability to backtrack or plan globally.
Written language breaks that constraint. You can stop. You can revise. You can hold a complex logical structure in mind (or on the page in front of you) and work back and forth within it. The page becomes external memory, which fundamentally changes what the production system has to do. You can build hierarchically — construct a paragraph with an argument before committing to the sentence order. That kind of production seems to require something more like deliberate symbolic manipulation, more akin to what the explicit grammar-and-semantics picture describes.
The interesting implication is that the classical generative grammar picture may not be wrong — it may simply be a description of a specialized, cultivated, effortful mode that humans can learn to engage, rather than the default architecture underlying ordinary speech. Your student had excellent language in the oral mode; what he lacked was the trained capacity to engage the more deliberate written mode.
Writing instruction as learning a different production system
This reframes what basic composition instruction actually is. It's not teaching people language — they already have language. It's teaching them a different production regime: how to initiate without external cuing, how to plan across a longer horizon, how to revise and restructure, how to maintain coherence across a text you can't hold in working memory all at once. Those are genuinely hard skills, and they don't come automatically from being a competent speaker.
Your solution — tell him to write like he's doing color commentary — was clever because it bypassed the blocked written-mode initiation and let the existing oral production system do its job. The cost was style: what came out was oral syntax, loose connectives, the kind of forward-momentum structure you hear rather than read. That's not a flaw in the underlying system; it's a feature of oral production that becomes a mismatch when transcribed.
The LLM connection
LLMs are, in a sense, trained overwhelmingly on written text — but text that spans an enormous range from formal academic prose to casual transcribed speech to everything in between. The interesting question is whether the model has anything like the distinction you're pointing at: a higher-effort, more globally structured mode versus a more locally-cued, forward-momentum mode. There's some suggestive evidence that prompting style and context can shift LLM outputs along something like that axis — careful analytical prose versus more associative, flowing response — but whether that reflects genuinely different internal production regimes or just surface statistical patterns is an open question. Your framework would suggest looking for it in the trajectory structure of the activations, not in the surface text alone.
I would think that Chain of Thought processing pushes things toward the analytical end of the spectrum.
Yes, and in a way that's quite revealing about what CoT actually is.
CoT as externalized working memory
The parallel to writing is almost exact. What writing gives the human composer is a page — external memory that lets you hold structure, refer back, revise, and plan forward beyond what working memory alone can support. What Chain of Thought gives the model is something functionally similar: a scratchpad in the context window. The model can, in effect, "look at" its own previous reasoning steps the way a writer looks back at the previous paragraph. That changes what the production system can do, not because the underlying weights changed, but because the available memory structure changed. This suggests CoT isn't really teaching the model to reason — it's changing the production regime. The capacity was latent; the scratchpad makes it accessible.
The self-cueing problem again
Recall the core difficulty with your hockey commentator: oral production works because external events cue the next chunk. Written production requires self-cueing from an internal plan. CoT solves a version of that problem for LLMs. Rather than having to leap directly from prompt to conclusion — which requires holding the entire inferential path implicitly in a single forward pass — the model can externalize intermediate steps, each of which then cues the next. It converts what would have to be a single long implicit trajectory into a chain of shorter, locally-cued ones.
That's a significant architectural workaround. It doesn't require deeper reasoning capacity in the weights; it requires the right scaffolding in the context.
The analytical style as an attractor
There's another dimension worth noting. CoT prompts — "let's think step by step," or just the presence of numbered reasoning in the context — don't only provide a scratchpad. They also serve as basin entry points for a particular kind of discourse. The model has been trained on enormous amounts of mathematical worked solutions, logical arguments, careful analytical prose. That material has its own attractor structure: a style of proceeding that is slower, more explicit, more self-monitoring. CoT prompting essentially loads the model into that basin rather than the more fluent associative one.
So two things are happening simultaneously: a structural change (scratchpad as external memory) and a stylistic/attractor shift (entry into deliberate-analytical mode). They tend to co-occur because that's how the training data is organized — careful reasoning appears in the corpus as written-out steps — but they're conceptually separable.
The interesting failure mode
If this picture is right, you'd expect CoT to help most in cases where the bottleneck is working memory or inferential chain length — multi-step arithmetic, logical puzzles, planning tasks — and to help less or even hurt in cases where the bottleneck is something else, like broad associative retrieval or judgment calls that depend on pattern recognition across many weak signals simultaneously. There's actually some empirical evidence consistent with that: CoT doesn't uniformly improve performance and can sometimes lead the model into confident but wrong step-by-step reasoning, where each local step looks plausible but the global trajectory drifts. That's the written-mode failure analog — the student who has learned to produce grammatical sentences and logical-looking transitions, but whose essay doesn't actually cohere at the top level. The oral commentator doesn't have that failure mode, interestingly. His errors are local and immediately visible. The analytical writer can construct an elaborate structure that is wrong all the way down, and it takes more effort to see it.
Something similar is going on with arithmetic calculation, which presents some problems to LLMs. When humans are confronted with multiple digit multiplication, or a long string of numbers to add, we use pen and paper as an external memory where we can keep track of the carry operations. LLMs don’t have that on a forward pass.
Exactly right, and it pinpoints something quite specific about why arithmetic is hard for LLMs in a way that's different from, say, factual recall or even logical reasoning.
The carry problem is a working memory problem, not a knowledge problem
The model knows perfectly well what 7 × 8 is, what carrying means, how the algorithm works. The difficulty is that multi-digit multiplication isn't really one computation — it's a sequence of dependent computations where the output of each step has to be held and fed into the next. On paper, you don't hold any of that in your head. You write down the partial products, you mark the carries, you add columns. The paper is doing most of the memory work; your head is only handling one small local operation at a time.
A single forward pass has no equivalent of that paper. Everything that needs to be tracked across steps has to be maintained implicitly in the activation state, which has real capacity limits for this kind of precise sequential bookkeeping. It's not that the model can't do arithmetic — it's that the production regime of a single forward pass is the wrong architecture for a task specifically designed, historically, to be offloaded onto an external medium.
Interesting historical note here
Long multiplication as we practice it is itself a technology — an algorithm designed around the properties of pen and paper. It's not the only way to multiply, and it's not the way humans multiplied before positional notation and cheap writing materials were widely available. Abacus-based systems, for instance, use a different external memory structure entirely. The algorithm and the medium co-evolved. That's worth keeping in mind: we're not asking LLMs to do "arithmetic" in some abstract sense — we're asking them to execute a specific paper-based algorithm without the paper.
CoT as providing the paper
This is where CoT comes back in, and in a sharper form than the general analytical-mode point. When a model is prompted to work through arithmetic step by step and write out the intermediate results — partial products, carries, running totals — the context window is functioning almost literally as the scratch paper. Each written intermediate result becomes available as a cue for the next operation, exactly as it would be on a page. The model is no longer trying to maintain a long chain of precise numerical dependencies implicitly; it's reading them off the externalized record.
This is probably the clearest case where the scratchpad function of CoT is doing the essential work, as distinct from the attractor-shift into analytical style. You could imagine a model that was in a highly analytical mode but still failed at long multiplication if it didn't actually write out the intermediate steps — because the bottleneck is memory, not style.
The deeper point about task-medium fit
What unifies the hockey commentator, the composition student, the oral psalm recall, and the arithmetic case is that cognitive performance is never just a function of the underlying competence or the weights. It's always a function of competence plus the available external memory and cueing structure. Humans evolved to be very good at certain things in certain media, and we've built our cognitive technologies — writing, notation, the abacus, the spreadsheet — to fit our limitations. LLMs have a different profile of limitations, and CoT is in effect the first crude cognitive technology built to fit their specific constraints rather than ours.
The interesting design question going forward is what other such technologies are possible — not just scratchpads, but perhaps more structured external representations that fit the specific ways LLM forward passes succeed and fail.
No comments:
Post a Comment