A useful explanatory section from Ali Minai's wonderful essay, Thinking Through the Risks of AI, 3 Quarks Daily, April 3, 2023:
As their name implies, LLMs focus on language. In particular, given a prompt – or context – an LLM tries to generate a sequence of sensible continuations. For example, given the context “It was the best of times; it was the”, the system might generate “worst” as the next word, and then, with the updated context “It was the best of times; it was the worst”, it might generate the next word, “of” and then “times”. However, it could, in principle, have generated some other plausible continuation, such as “It was the best of times; it was the beginning of spring in the valley” (though, in practice, it rarely does because it knows Dickens too well). This process of generating continuation words one by one and feeding them back to generate the next one is called autoregression, and today’s LLMs are autoregressive text generators (in fact, LLMs generate partial words called tokens which are then combined into words, but that need not concern us here.) To us – familiar with the nature and complexity of language – this seems to be an absurdly unnatural way to produce linguistic expression. After all, real human discourse is messy and complicated, with ambiguous references, nested clauses, varied syntax, double meanings, etc. No human would concede that they generate their utterances sequentially, one word at a time.
Rather, we feel that our complex expressions reflect the complex structure of the thoughts we are trying to express. And what of grammar and syntax – things that we learned implicitly when we learned the language we speak, perhaps building on a universal grammar innate in the human brain as the great linguist, Noam Chomsky, has proposed? And yet, ChatGPT, generating one word (or token) at a time, ends up producing text that is every bit as meaningful, complex, nuanced, syntactically deep and grammatically correct as human expression. This fact – not its hallucinations and logical challenges – is the most interesting thing about ChatGPT. It should amaze us, and we should ask what it might tell us about human language and thought.
Clearly, one thing it tells us is that, after training a very large adaptive system on an extremely large amount of real-world text data, all the complexities of syntax, grammar, inference, reference, etc., can be captured in a simple sequential autoregressive generation process. In a sense, the deep structure of language can convincingly be “flattened” into the simple process of generating word sequences. Two crucial attributes of LLMs make this possible: Extended context, and adaptive attention. Unlike the example given above, an LLM does not generate the next word by looking just a few preceding words; it looks at several thousand preceding words – at least once the pump is primed sufficiently. And, very importantly, it does not treat all these thousands of words as a simple “bag of words” or even a simple sequence of words; it learns to discern which ones to attend to in what degree at each point in the generative process, and use that to generate the continuation. This is adaptive attention.
The key thing the makes all this work is the depth of the system. As most readers will know by now, LLMs are neural networks – systems with tens of millions of relatively simple but nonlinear interconnected processors called neurons, inspired by the networks of neurons in the brain. The artificial neurons in an LLM network are arranged in layers, with the output from each layer moving sequentially to the next layer (or other higher layers), which is why these are called feed-forward networks (except for the final output being fed back into the system as input for the next word). The exact architecture of ChatGPT is not known publicly but it certainly has several hundred – perhaps more than a thousand – layers of neurons. Some layers operate in parallel with each other while others are connected serially. The number of pairwise connections between neurons is in excess of 175 billion in the original version based on GPT-3. The strengths of these connections determine what happens to information as it moves through the network, and, therefore, what output is produced for a given input. The network is trained by adjusting these connection strengths, or weights, until the system produces correct responses on its training data. While layers can perform various types of computations, such as merging the signals from several prior layers, multiplying them together, or transforming them in nonlinear ways, there are at least 96 places in the network where adaptive attention is applied to all or some of the data moving through the network. Thus, in determining the next word to generate, ChatGPT takes the initial context input through many, many stages of analysis – implicitly inferring its syntactic and semantic organization, detecting dependencies, assigning references, etc. It is this extensively dissected, modulated, squeezed, recombined and analyzed version of the input that is used finally to generate the output. The output layer produces a probability distribution for the next word over the whole vocabulary, from which the actual output word is sampled. That is why ChatGPT gives a fresh answer to the same question each time.
No comments:
Post a Comment