Sunday, August 9, 2020

GPT-3 for (not so) Dummies [performance rather than competence]

Noam Chomsky famously distinguished between linguistic competence and performance in Aspects of the Theory of Syntax (1965). The distinction is a bit obscure. Here’s what David Hays wrote in an article we co-authored:
To describe an assembled bicycle is one thing: to describe the assembly of bicycles is another. The assembler must know what is in the blueprint (competence), but further needs the skill to take the parts in order, place them deftly, fasten them neatly (performance). In actuality, of course, the assembler may never have seen the blueprint, nor need the performance of a speaker or hearer include in any physical sense the grammar that the linguist offers as the blueprint of a language. [1]
The distinction allowed Chomsky to treat syntax as being formally well-formed, in the manner of a logical or mathematical expression, while making room for the fact that actual speech is often ill-formed, full of interruptions and hesitations, and incomplete. Those imperfections belong to the realm of performance while syntax itself is in the realm of competence.

What makes the distinction obscure is that Chomsky did not offer nor was he even interested in a theory of performance. Competence is all he was interested in, and his account of that competence took a form that, at first glance, seemed like an account of performance. But his generative grammar, with its ordering of rules, is a static system and that ordering is about logical priority, not temporal process. This comes clear, however, only when you attempt to specify a computational process that applies the grammar to a language string.

But it’s not Chomsky’s linguistics that interests me. It’s GPT-3. GPT-3s operations have been fairly well described in the literature in various levels of technical detail [2]. That is a matter of performance. But what it does, that is a bit obscure. It somehow embodies a “theory” of discourse structure, but just what that theory might be, that is nowhere discussed in the literature [3]. GPT-3, like much NLP favors empiricism over theory. If it works, we’ll keep it; just why it works, that is not our concern.

The rest of this post is an account of GPT-3’s performance that I posted to the Humanist list [4].

An brief and informal account of what GPT-3 does

Brigitte Rath points out [Humanist 34.214]:
Ferdinand de Saussure's famous structuralist model of language is also relational, but it is heterogeneous: all signifiers -- "sound images" for Saussure -- are of the same kind, and within the homogeneous set of signifiers -- all of them "sound images" --, each signifier is defined precisely by being different from all others. The same holds for all signifieds, concepts for Saussure. A sign is formed when a signifier is connected to a signified -- a sound image to a concept -- and thus when *categorically different* units, each defined differentially within its own homogeneous system, are brought together. Signification arises out of a *heterogeneous* system.
First, as far as I know, Saussure never studied semantics very much, nor did other structural linguists. Lévi-Strauss’s work on myth (the four volumes of Mythologies in particular) may be the most significant body of structuralist work we have on relationships among signifiers. But while that work has been much cited its substance as been all but forgotten. For that matter, linguists paid very little attention to semantics until well into the post 1973 era of the cognitive sciences (’73 is when Longuet-Higgins coined the term “cognitive-sciences”).

We need to think very carefully about what GPT-3 does because, while it is true that it has no access to signifiers (and I really have to use this terminology, otherwise all is lost), it has nonetheless somehow managed, in effect, to infer a structure of what we might call virtual or ghost signifiers (my terms). What makes its output so uncanny is that we can easily read it as though it was created by a being in more or less full possession of a full relational network of signifiers. How did it do that? It did it by learning to predict what comes next in text created by people, that is, by beings in full possession of the relational network of signifieds. We need to keep that in mind, always.

How does GPT-3 work? Roughly:

1. First the signifier tokens in the corpus are replaced by vectors of numbers which position those signifiers in a high-dimensional space of “virtual” semantics (actually, it is vector or distributional semantics). [5]

2. Then GPT-3 trains on the corpus (300 billion tokens) by moving through it from the beginning to the end, attempting to predict the next word at each step of the way. GPT-3 has 175B parameters distributed over the 96 layers of its neural net. These parameters are initialized with random weights. Those values are then modified during training. GPT-3 ‘read’ a word and guesses the next one. Initially, the guess will be wrong. So the parameter weights are adjusted accordingly (back-propagation). It guesses again; if it’s wrong, weights are revised; if it is right, it goes on to guess the next word. And so on. When it has gone through the entire corpus the parameter weights reflect what it has learned about the distribution of signifiers in a very large body of real text. Remember, though, that the actual distribution was created by people who possess those signifiers. So GPT-3’s parameter weights now somehow reflect of embody the relational structure of those signifieds.

3. A user then feeds GPT-3 a prompt and GPT-3 continues the language in the prompt by using its stored weights. It is, in effect, guessing the next word, and the next, and so on. The linear structure of signifiers that it creates as output reflects the multidimensional pattern of relations among signifieds as stored in its parameter weights.

While sign relationship between a signifier and a signified is arbitrary, there is nothing at all arbitrary about the relationships among a string of signifiers in a text. Those relationships are ultimately governed by the system of signifieds as expressed through the mechanisms of syntax and pragmatics.

I strongly suspect that the mere fact that GPT-3s methods work so well (and not only GPT-3) gives us strong clues about the nature of the system of signifieds. That gives this work some philosophical and psychological heft.

I’ll also say that I do not at all believe that GPT-3 is only one, two, or three steps away from becoming that magical mystical being, the AGI (artificial general intelligence). The fact is, what it “knows”, it knows quite rigidly. The ONLY thing it can do is to make predictions about what comes next. And I am skeptical about the value of trying models with even more parameters. The models may gain more power for awhile, but they’ll approach an asymptote, as they do that, however, they’ll keep using electricity at a ferocious rate. It’s estimated that GPT-3’s training run used $4.6 million (US dollars) worth of electrical power.

As a final note, I’m pretty sure that many AI researchers were inspired by the computer in Star Trek. I know that David Ferrucci, of Watson fame, was. Isn’t the Star Trek Computer (STC) a general artificial intelligence? I think so. Is it some kind of super-intelligence, more intelligent than its human users? No, it isn’t. It has superior access to a wide range of information and is a whizz at calculating, correlating, and so forth. But it doesn’t conduct original research. Is the STC malevolent? Not at all. It seems to me that the STC is just what we want from an advanced artificial intelligence. Maybe we’ll get it one day.

References

[1] William Benzon and David Hays, “Computational Linguistics and the Humanist”, Computers and the Humanities, Vol. 10. 1976, pp. 265-274, https://www.academia.edu/1334653/Computational_Linguistics_and_the_Humanist.

[2] There is a blither of material on the web about how transformers – the generic technology behind GPT-3 – work. I have found these posts by Jay Alammar useful: The Illustrated GPT-2 (Visualizing Transformer Language Models), https://jalammar.github.io/illustrated-gpt2/. How GPT3 Works - Visualizations and Animations, https://jalammar.github.io/how-gpt3-works-visualizations-animations/.

[3] This is well-discussed, pro and con, in Y Combinator’s Hacker News, https://news.ycombinator.com/item?id=23623845.

[4] Humanist 34.221: on GPT-3 and imitation, https://dhhumanist.org/volume/34/221.

[5] This is not the place to explain how this is done. The basic idea was invented by Gerard Salton in the 1960s and ‘70s and used in document retrieval. I’ve written a blog post on the subject: Notes toward a theory of the corpus, Part 2: Mind [#DH], https://new-savanna.blogspot.com/2018/12/notes-toward-theory-of-corpus-part-2.html. Michael Gavin has written a very useful and more detailed article on the subject: Is there a text in my data? (Part 1): On Counting Words, Cultural Analytics, January 25, 2020, https://doi.org/10.22148/001c.11830.

No comments:

Post a Comment