Monday, January 16, 2023

At the most general level, what are transformers doing?

By transformer I mean the computer architecture behind large language models such as ChatGPT. By “most general level” I mean, just that. I have no intention of getting into the technical details of transformers.

Let us assume that the contents of the human mind are organized in some high-dimensional space that is realized in the intricate interconnections of 86 billion neurons, where each neuron is linked to 10K other neurons on the average. An individual thought, whatever that is in neural terms, might thus be spread across 100s of thousands of neurons and millions of synapses all active at the same time, all present in us in an instant. But when we go to communicate such thoughts to others through language, we cannot instantaneously convey them to another in the way we can hold them in our minds in an instant. We have to string them out in language, one word at time.

Let me give you an example of the sort of experience behind this intuition.

I’ve been thinking about something and get an idea for a blog post. OK, I’ll discuss this, then that, look at the other thing and wrap it all up. This is going to be a short one, take me maybe a half-hour, forty-five minutes to write up. I sit down at my computer and three hours later I’m still writing, though I sense that I’ll be able to finish soon enough.

And I do. But how could I misjudge how long it would take me to express the idea in writing? After all, I have decades of experience with writing. With all that experience I’m still a poor judge of how much time it will take me to linearize the multidimensional-all-at-onceness that characterizes thoughts in my head.

Going from all-at-once to one-at-a-time, that’s the job that language structure does. Phonology organizes sound “atoms” into strings that express words that are readily distinguishable from one another. Syntax organizes words into strings according to the relationships between the concepts expressed in those words. Finally, at the highest level, discourse links sentences into larger structural units. I’m sure that there’s more to the distinction between sentence syntax and discourse structure than the length of the strings, but I’m not sure how to characterize that distinction. I would think that the distinction is adequately accounted for in the literature, but I can’t off hand point to it. But we need not worry about that now. My point is simply that there’s a lot of linguistic structure that is called up in linearizing thought.

And, of course, it works both ways. The speaker has to linearize their thoughts into a string, and the listener has to take that string, word by word, and project those meanings into the multidimensional all-at-onceness that is thought. That is, the listener has to take a linear string of word meanings, and project it into a high-dimensional “hyperplane” (manifold?) of thought.

Now, the transformer. As all the accounts say, the job of the transformer is to predict the next word in a string that it is “consuming” – a word that I like in this context for some reason. The transformer is set up to contain lots of parameters arrange in layers. In the case of GPT-3, which is the basis for ChatGPT, we have 175 billion parameters, arranged into I don’t know how many layers, but much closer to 100 than 1000. Those parameters in those layers constitute the language model. They are predicting the next word. If they predict correctly the weights on the parameters are adjusted accordingly. And the same if they predict incorrectly.

This business of predicting the next word is thus a technique of creating the model. It’s the structure of the model that’s important. That’s where the “knowledge” is. It’s that structure we need to think about.

What I’m thinking is that when the numbers of parameters is large enough and the text corpus consumed by the engine is large enough, the resulting language model is forced to differentiate into two or more, call them, levels. These levels are inextricably mixed in the basic structure of parameters and layers, but they are functionally distinct. Think of these levels as hyper-layers, if you will. My work with ChatGPT over the past month and a half has convinced me that there is a structure that can be discerned in the structure of texts that is organized on a higher-level hyper-layer. This is particularly true of my work on stories, where it is clear that ChatGPT has a story grammar that can be roughly characterized in terms of frame, slots, and fillers. We need not have access to the parameters and layers of the language model to do this. In fact, we aren’t going to find this structure if we directly to the language model. However, if we do our analytical and descriptive work well, we will discover patterns that give us clues to what is going on in those 175 billion parameters organized in 100s of layers.

And that differentiation into structural levels, that is forced by the requirements of linearizing thoughts into string of words. The individual “neurons” in these models are not much like real neurons, and the layer structure of these models has no more than a remote resemblance to the structure of the cerebral cortex, but it’s not the physical structure that’s important, not quite. It’s the functional structure. The functional structure of artificial neural nets is somehow able to capture non-trivial aspects of the human mind as implemented in the brain.

No comments:

Post a Comment