Monday, July 27, 2020

1. No meaning, no how: GPT-3 as Rubicon and Waterloo, a personal view

I say that not merely because I am a person and, as such, I have a point of view on GPT-3, and related matters. I say because the discussion is informal, without journal-class discussion of this, that, and the others, along with the attendant burden of citation, though I will offer a few citations. More over, I’m pretty much making this up as I go along. That is to say, I am trying to figure out just what it is that I think, and see value in doing so in public.

What value, you ask? It commits me to certain ideas, if only at a certain time. It lays out a set of priors and thus serves to sharpen my ideas developments unfold and I, inevitably, reconsider.

GPT-3 represents an achievement of a high order; it deserves the attention it has received, if not the hype. We are now deep in “here be dragons” territory and we cannot go back. And yet, if we are not careful, we’ll never leave the dragons, we’ll always be wild and undisciplined. We will never actually advance; we’ll just spin faster and faster. Hence GPT-3 is both a Rubicon, the crossing of a threshold, and a potential Waterloo, a battle we cannot win.

Here’s my plan: First we take a look at history, at the origins of machine translation and symbolic AI. Then I develop a fairly standard critic of semantic models such as those used in GPT-3 which I follow with some remarks by Martin Kay, one of the Grand Old Men of computational linguistics. Then I look at the problem of common sense reasoning and conclude be looking ahead to the next post in this series in which I offer some speculations on why (and perhaps even how) these models can succeed despite their sever and fundamental short-comings.

Background: MT and Symbolic computing

It all began with a famous memo Warren Weaver wrote in 1949. Weaver was director of the Natural Sciences division of the Rockefeller Foundation from 1932 to 1955. He collaborated Claude Shannon in the publication of a book which popularized Shannon’s seminal work in information theory, The Mathematical Theory of Communication. Weaver’s 1949 memorandum, simply entitled “Translation” [1], is regarded as the catalytic document in the origin of machine translation (MT) and hence of computational linguistics (CL) and heck! why not? artificial intelligence (AI).

Let’s skip to the fifth section of Weaver’s memo, “Meaning and Context” (p. 8):
First, let us think of a way in which the problem of multiple meaning can, in principle at least, be solved. If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words. “Fast” may mean “rapid”; or it may mean "motionless"; and there is no way of telling which.

But if one lengthens the slit in the opaque mask, until one can see not only the central word in question, but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word. The formal truth of this statement becomes clear when one mentions that the middle word of a whole article or a whole book is unambiguous if one has read the whole article or book, providing of course that the article or book is sufficiently well written to communicate at all.
It wasn’t until the 1960s and ‘70s that computer scientists would make use of this insight; Gerard Salton was the central figure and he was interested in document retrieval [2]. Salton would represent documents as a vector of words and then query a database of such representation by using a vector composed from user input. Documents were retrieved as a function of similarity between the input query vector and the stored document vector.

Work on MT went a different way. Various approaches were used, but at some relatively early point researchers were writing formal grammars of languages. In some cases these grammars were engineering conveniences while in others they were taken to represent the mental grammars of humans. In any event, that enterprise fell apart in the mid-1960s. The prospects for practical results could not justify federal funding and the government had interest in supporting purely scientific research into the nature of language.

But such research continued nonetheless, sometimes under the rubric of computational linguistics (CL) and sometimes as AI. I encountered CL in graduate school in the mid-1970s when I joined the research group of David Hays in the Linguistics Department of the State University of New York at Buffalo – I was actually enrolled as a graduate student in English; it’s complicated.

Many different semantic models were developed, but I’m not interested in anything like a review of that work, just a little taste. In particular I am interested in a general type of model was known as a semantic or cognitive network. Hays had been developing such a model for some years in conjunction with several graduate students [2]. Here’s a fragment of a network from a system developed by one of those students, Brian Philips, to tell whether or not stories of people drowning were tragic [3]. Here’s a representation of capsize:
Notice that there are two kinds of nodes in the network, square ones and smaller round ones. The square ones represent a scene while the round ones represent individual objects or events. Thus the square node at the upper left indicates a scene with two sub-scenes – I’m just going to follow out the logic of the network without explaining it in any detail. The first one asserts that there is a boat that contains one Horatio Smith. The second one asserts that the boat overturns. And so forth through the rest of the diagram.

This network represents semantic structure. In the terminology of semiotics, it represents a network of signifieds. Though Philips didn’t do so, it would be entirely possible to link such a semantic network with a syntactic network, and many systems of that era did so.

Such networks were symbolic in the (obvious) sense that the objects in them were considered to be symbols, not sense perceptions or motor actions nor, for that matter, neurons, whether real or artificial. The relationship between such systems and the human brain was not explored, either in theory or in experimental observation. It wasn’t an issue.

That enterprise collapsed in the mid-1980s. Why? The models had to be hand-coded, which took time. They were computationally expensive and so-called common sense reasoning proved to be endless, making the models larger and larger. (I discuss common sense below and I have many posts at New Savanna on the topic [4].)

Oh, the work didn’t stop entirely. Some researchers kept at it. But interests shifted toward machine learning techniques and toward artificial neural networks. That is the line of evolution that has, three or four decades later, resulted in systems like GPT-3, which also owe a debt to the vector semantics pioneered by Salton. Such systems build huge language models from huge databases – GPT-3 is based on 500 billion tokens [5] – and contain no explicit models of syntax or semantics anywhere, at least not that researchers can recognize.

Researchers build a system that constructs a language model (“learns” the language), but the inner workings of that model are opaque to the researchers. After all, the system built the model, not the researchers. They only built the system.

It is a strange situation.

No words, only signifiers

Let us start with the basics. There are no “words” as such in the database on which GPT-3’s language model is based. Words have spelling, pronunciation, meaning, often various meanings, grammatical usage, and connotations and implications. All that exists in the database are spellings, bare naked signifiers. No signifieds, that is to say, semantics and more generally concepts and even percepts. And certainly no referents, things and situations in the world to which words refer. The database is all utterly empty signification, and yet it exhibits structure and the order in GPT-3’s language model derives from that order.

Remember, the database consists of words people wrote while negotiating their way in the world. In their heads they’ve got a highly structured model of the world, a semantics (which rides on perception and cognition). Let us say, for the moment, that the semantic model is multidimensional. Linguistic syntax maps that multidimensional semantics onto a one-dimensional string which can be communicated through speech, gesture, or writing.

GPT-3 has access only to those strings. It ‘knows’ nothing of the world, nor of syntax, much less semantics, cognition and perception. What it is ‘examining, in those strings, however, reflects the interaction of human minds and the world. While there is an obvious sense in which the structure in those strings comes from the human mind, we also have to take the world into account. For the people who created those strings were not just spinning language out for the fun of it – oh, some of them were. But even poets, novelists, and playwrights attend to the world’s structure in their writing.

What GPT-3 recovers and constructs from the data it ingests is thus a simulacrum of the interaction between people and the world. There is no meaning there. Only entanglement. And yet what it does with that entanglement is so very interesting and has profound consequences – but more of that later.

* * * * *

Now, we as users, as clients of such systems, are fooled by our semiotic naivety. Even when we’ve taken semiotics 101, we look at a written signifier and we take it for a word, automatically and without thought, with its various meanings and implications. But it isn’t a word, not really.

Yes, in normal circumstances – talking with one another, reading various documents – it makes sense for us to treat signifiers as words. As such those signifiers are linked to signifieds (semantics, concepts, percepts) and referents (things and situations in the world). But output from GPT-3 is not normal circumstances. It’s working from a huge database of signifiers, but no matter how you bend, fold, spindle, or mutilate those signifiers, you’re not going to get a scintilla of meaning. Any meaning you see, is meaning you put there.

Where did those signifiers come from? That’s right, those millions if not billions of people writing away. Writing about the world. So there is in fact something about the world intertwined with those signifiers, just as there is something about the structure of the minds that composed them. The structure of minds and of the world have become entangled and projected onto one freakishly long and entangled pile of strings. That is what GPT-3 works with to generate its language model.

Let me repeat this once again, obvious though it is: Those words in the database were generated by people conveying knowledge of, attempting to make sense of, the world. Those strings are coupled with the world, albeit asynchronously. Without that coupling, that database would collapse into an unordered pile of bare naked signifiers. It is that coupling with the world the authorizes our treatment of those signifiers as full-on words.

We need to be clear on the distinction between the language system as it exists in the minds of people and the many and various texts those people generate as they employ that system to communicate about and make sense of the world. It would be a mistake to think that the GPT-3 language model is only about what is inside people’s heads. It is also about the world, for those people use what is in their heads to negotiate their way in the world. [I intend to “cash out” on my insistence on this point in the next post.]

Martin Kay, “an ignorance model”

With that in mind let’s consider what Martin Kay has to say about statistical language processing. Martin Kay is one of the Grand Old Men of computational linguistics. He was originally trained in Great Britain by Margaret Masterman, a student of Ludwig Wittgenstein, and moved to the United States in the 1950s where he worked with by teacher and colleague, David Hays. Before he had come to SUNY Buffalo, Hays had run the RAND Corporation’s program in machine translation.

In the early 2000s the Association for Computational Linguistics gave Kay a lifetime achievement award and he delivered some remarks on that occasion [6]. At the end he says (p. 438):
Statistical NLP has opened the road to applications, funding, and respectability for our field. I wish it well. I think it is a great enterprise, despite what I may have seemed to say to the contrary.
Prior to that he had this to say (437):
Symbolic language processing is highly nondeterministic and often delivers large numbers of alternative results because it has no means of resolving the ambiguities that characterize ordinary language. This is for the clear and obvious reason that the resolution of ambiguities is not a linguistic matter. After a responsible job has been done of linguistic analysis, what remain are questions about the world. They are questions of what would be a reasonable thing to say under the given circumstances, what it would be reasonable to believe, suspect, fear, or desire in the given situation. If these questions are in the purview of any academic discipline, it is presumably artificial intelligence. But artificial intelligence has a lot on its plate and to attempt to fill the void that it leaves open, in whatever way comes to hand, is entirely reasonable and proper. But it is important to understand what we are doing when we do this and to calibrate our expectations accordingly. What we are doing is to allow statistics over words that occur very close to one another in a string to stand in for the world construed widely, so as to include myths, and beliefs, and cultures, and truths and lies and so forth. As a stop-gap for the time being, this may be as good as we can do, but we should clearly have only the most limited expectations of it because, for the purpose it is intended to serve, it is clearly pathetically inadequate. The statistics are standing in for a vast number of things for which we have no computer model. They are therefore what I call an “ignorance model.”
An “ignorance model” – that, more or less, is what I said in the previous section when I said that GPT-3 had no access to semantics.

There are some thinkers who believe it is necessary to somehow integrate these statistical methods with the “old school” symbolic methods pioneered by Kay and by David Hays and many others. That’s what I think, and have thought so for some time. I know that Gary Marcus and Rodney Brooks have a start-up built on that premise, Robust AI [7]. So does David Ferrucci, who headed IBM’s Watson project when it won at Jeopardy: Elemental Cognition [8].

Common sense reasoning about the world

As I’ve already observed, one of the problems that sunk symbolic processing in the mid-1980s is that of common sense reasoning. This has become an issue with current work as well. While GPT-3 does exhibit impressive performance in this area, there are signs that not all is as it seems.

Kevin Lacker observes [9]:
Traditionally, artificial intelligence struggles at “common sense”. But GPT-3 can answer a lot of common sense questions.
Q: What is your favorite animal?
A: My favorite animal is a dog.
Q: Why?
A: Because dogs are loyal and friendly.
Q: What are two reasons that a dog might be in a bad mood?
A: Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many legs does a frog have?
A: A frog has four legs.
Q: Are there any animals with three legs?
A: No, there are no animals with three legs.
Q: Why don't animals have three legs?
A: Animals don't have three legs because they would fall over.
Ten years ago, if I had this conversation, I would have assumed the entity on the other end was a human. You can no longer take it for granted that an AI does not know the answer to “common sense” questions.
Yes. But Lacker goes on:
We need to ask questions that no normal human would ever talk about.
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many eyes does my foot have?
A: Your foot has two eyes.
Q: How many eyes does a spider have?
A: A spider has eight eyes.
Q: How many eyes does the sun have?
A: The sun has one eye.
Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.
Now we’re getting into surreal territory. GPT-3 knows how to have a normal conversation. It doesn’t quite know how to say “Wait a moment... your question is nonsense.” It also doesn’t know how to say “I don’t know.”
That is, if you probe a bit, GPT-3 does have problems with common sense reasoning.

A lot of common sense reasoning takes place “close” to the physical world. I have come to believe, but will not here argue, much of our basic (‘common sense’) knowledge of the physical world is grounded in analogue and quasi-analogue representations [10]. This gives us the power to generate language about such matters on the fly. Old school symbolic machines did not have this capacity nor do current statistical models, such as GPT-3.

But then how can a system generate analog or quasi-analog representations of the world unless it has direct access to the world? The creators of GPT-3 acknowledge this as a limitation: “Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world [BHT+20]” [11]. Perhaps only robots will be able to develop robust common sense knowledge and reasoning.

And yet GPT-3 seems so effective. How can that be?

This is a critique from first principles and, as such, seems to me to be unassailable. Equally unassailable, however, are the empirical facts on the ground: these systems do work. And here I’m not talking only about GPT-3 and its immediate predecessors. I’ve done much of my thinking about this in connection with other kinds of systems based on distributional semantics, such as topic modeling, for one example.

Thus I have little choice, it seems, but to hazard an account of just why these models are effective. That’s my task for the next post in this series, tentatively entitled, “The brain, humans, and the computer, GPT-3: a crude, but useful, analogy”. Note that I do not mean to explicate the computational processes used in GPT-3, not at all. Rather, I am going to speculate about what there is in the nature of the mind, and perhaps even of the world, that allows such mechanisms to succeed.

It is common to think of language as loose, fuzzy, and imprecise. And so it is. But that cannot and is not all there is to language. In order for language to work at all there must be a rigid and inflexible aspect to it. That is what I’ll be talking about in the next post. I’ll be building on theoretical work by Sydney Lamb, Peter Gärdenfors, and a comment Graham Neubig made in a discussion about semantics and machine learning.

* * * * *

Posts in this series are gathered under this link: Rubicon-Waterloo.

Appendix: The road ahead

This series is based on a comment I made at Marginal Revolution. Here is a revised version of that comment. The material highlighted in yellow is the material I’ve covered in this post.
Yes, GPT-3 [may] be a game changer. But to get there from here we need to rethink a lot of things. And where that's going (that is, where I think it best should go) is more than I can do in a comment.

Right now, we're doing it wrong, headed in the wrong direction. AGI, a really good one, isn't going to be what we're imagining it to be, e.g. the Star Trek computer.

Think AI as platform, not feature (Andreessen). Obvious implication, the basic computer will be an AI-as-platform. Every human will get their own as an very young child. They're grow with it; it'll grow with them. The child will care for it as with a pet. Hence we have ethical obligations to them. As the child grows, so does the pet – the pet will likely have to migrate to other physical platforms from time to time.

Machine learning was the key breakthrough. Rodney Brooks' Gengis, with its subsumption architecture, was a key development as well, for it was directed at robots moving about in the world. FWIW Brooks has teamed up with Gary Marcus and they think we need to add some old school symbolic computing into the mix. I think they're right.

Machines, however, have a hard time learning the natural world as humans do. We're born primed to deal with that world with millions of years of evolutionary history behind us. Machines, alas, are a blank slate.

The native environment for computers is, of course, the computational environment. That's where to apply machine learning. Note that writing code is one of GPT-3's skills.

So, the AGI of the future, let's call it GPT-42, will be looking in two directions, toward the world of computers and toward the human world. It will be learning in both, but in different styles and to different ends. In its interaction with other artificial computational entities GPT-42 is in its native milieu. In its interaction with us, well, we'll necessarily be in the driver's seat.

Where are we with respect to the hockey stick growth curve? For the last 3/4 quarters of a century, since the end of WWII, we've been moving horizontally, along a plateau, developing tech. GPT-3 is one signal that we've reached the toe of the next curve. But to move up the curve, as I've said, we have to rethink the whole shebang.

We're IN the Singularity. Here be dragons.

[Superintelligent computers emerging out of the FOOM is bullshit.]
* * * * *

ADDENDUM: A friend of mine, David Porush, has reminded me that Neal Stephenson has written of such a tutor in The Diamond Age: Or, A Young Lady's Illustrated Primer (1995). I then remembered that I have played the role of such a tutor in real life, The Freedoniad: A Tale of Epic Adventure in which Two BFFs Travel the Universe and End up in Dunkirk, New York.

[1] Warren Weaver, “Translation”, Carlsbad, NM, July 15, 1949, 12. pp. Online:

[2] David Durbin, The Most Influential Paper Gerard Salton Never Wrote, Library Trends, Vol. 52, No. 4, Spring 2004, pp. 748-764.

[3] For a basic account of cognitive networks, see David G. Hays. Networks, Cognitive. In (Allen Kent, Harold Lancour, Jay E. Daily, eds.): Encyclopedia of Library and Information Science, Vol 19. Marcel Dekker, Inc., NY 1976, 281-300.

[4] Brian Phillips. A Model for Knowledge and Its Application to Discourse Analysis, American Journal of Computational Linguistics, Microfiche 82, (1979).

[4] My various posts on common sense are at this link:

[6] Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language Models are Few-Shot Learners, arXiv:2005.14165v4 [cs.CL]

[7] Martin Kay, A Life of Language, Journal Computational Linguistics, Volume 31 Issue 4, December 2005, pp. 425-438,

[8] Robust AI,

[9] Elemental Cognition,

[10] Kevin Lacker's blog, Giving GPT-3 a Turing Test,

[11] For a superb analog model see William Powers, Behavior: The Control of Perception (Aldine) 1973. Don’t let the publication date fool; Powers develops his model with a simplicity and elegance that makes it well worth our attention even now, almost 50 years later. Hays integrated Powers’ model into his cognitive network model, see David G. Hays, Cognitive Structures, HRAF Press, 1981. Also, see my post, Computation, Mind, and the World [bounding AI], New Savanna, blog post, December 28, 2019,

No comments:

Post a Comment