Wednesday, June 3, 2026

Correcting Cowen’s misleading presentation large language models [MR #10]

Surprise! There’s been a change of plans. The last time I’d posted about Cowen’s monograph on marginalism I figured I had one more (longish) blog post, one about the fourth and final chapter, “Why Marginalism Will Dwindle, and What Will Replace It?” But the more I thought about it, the longer and more convoluted it got. So I’ve decided to simplify things by writing three posts, each substantial, but focused, instead of a long rambling affair like the one I did on biology. So, I ‘m writing one post about large language models (this post), which Cowen brings up at the end of the chapter. Then I’m writing one about high dimensional models in economics, which Cowen introduces early in the chapter. My final post will be a general response to Cowen’s ideas about where this is all headed.

In this post I want to do three things: 1) First I’ll talk about the surprise nature of the success achieved by GPT-3 and then ChatGPT. 2) Then I will present three passages from Cowen’s text and comment on them. 3) Finally, I want to give a brief rundown of tradition of statistical work that stands behind LLMs.

Surprise!

OpenAI released GPT-3 in 2020 to a limited audience of insiders, who recognized that it represented a breakthrough. This level of performance came as a surprise. No one predicted it. GPT-3 was scaled up from GPT-2, which was in turn scaled up from GPT-1, but no one was making explicit predictions about the level of performance to be achieved at each step. These were experiments: “Let’s try it and see what happens.” That’s fine. That’s a good way to make progress, to try things out and see what happens. But don’t mistake a lucky trial for genuine knowledge.

Cowen mentioned GPT-3 on Marginal Revolution on July 19, and then published a Bloomberg column on it on July 21, which he excerpted in Marginal Revolution the next day: “...think of GPT-3 as giving computers a facility with words that they have had with numbers for a long time, and with images since about 2012.” I published a working paper in August, GPT-3: Waterloo or Rubicon? Here be Dragons, in which I both acknowledged about the breakthrough and cautioned about becoming too satisfied with the technology that occasioned the breakthrough.

Two and a half years later, in November of 2022, OpenAI released ChatGPT to the general public. It spread like wildfire. Now the proverbial everyone witnessed what only a small group had witnessed in the summer of 2020. The machine speaks. Sorta’. But more convincingly than any machine had spoken before and in a way that had unimaginable implications for the future.

A threshold HAS been crossed, but it is not, so far as I can see, a threshold in our understanding, either of AI or anything else. It is a threshold in performance along a continuous line of scientific understanding and engineering design and construction, something I have documented in some detail in a recent working paper, The Origins of LLMs. As far as I can tell, there has been no paradigm shift, in Thomas Kuhn’s sense, no rank shift, in terms of cognitive rank theory. There were no fundamentally new ideas in the world by, say, late July of 2020 as a consequence consolidating GPT-3 and making it available in limited release.

“What about the scaling hypothesis,” you might ask. “Isn’t that new?” Ilya Sutskever first explored the idea in 2014. Rich Sutton’s famous 2019 essay, The Bitter Lesson, generated broad discussion about the issue. Then OpenAI published a paper in 2020 that cemented matters, “Scaling Laws for Neural Language Models.”

Given the nature of computing, scaling up is not trivial. Hundreds if not thousands of technical details need to be worked out as the size of the training corpus increases by factors of 10 or more, time after time, and as more and more GPUs are ganged together to assemble the computing power needed. The scaling hypothesis gave researchers a reason to expect improved performance with scaling, but without having to make fundamental breakthroughs in understanding, not of machine learning, artificial neural nets, and certainly not about language and cognition. Consequently our sense of possibility has expanded enormously. Our knowledge and deep understanding have remained the same and the scaling hypothesis made it easy to believe that that was just fine.

Passages from Cowen’s Text

Unfortunately Cowen seems to have bought this story. Not only that, but he doesn’t even acknowledge that there is considerable current debate about whether or not LLMs will be sufficient to achieve AGI (artificial general intelligence) when they are scaled up enough. The most visible opponent of this idea is Gary Marcus, a student of Steven Pinker, who argues that we need to incorporate insights and technology from “old school” symbolic computing (sometimes known as GOFAI, good old-fashioned AI). Marcus is certainly not alone, there are many others. But I don’t want to reprise that debate. I just want to mention that it exists and that Cowen completely ignores it.

What I would like to do in this section is quote some passages from his text and comment on them.

pp. 106-107:

Suffice to say, LLM construction has for the most part ignored linguists and philosophers, and that also means ignoring their intuitions. LLM construction also ignored a lot of people in the AI field who insisted neural nets were a dead end. Instead, in a relatively short number of years humans invented new ways of modeling language and reasoning through language. That research program has proven wildly successful, as we have much better models of language and reasoning than almost anyone had been expecting.

That first sentence is true, sorta’. It is also misleading. As I have documented in that working paper, The Origins of LLMs, this technology is based on a continuous line of statistical thinking that extends back to the 1950s (I take a brief look at this in the next section) . It is the syntacticians, semanticists, and cognitive scientists who been ignored. The second sentence is a bit of an exaggeration. AlexNet put neural nets firmly back on the agenda in 2012.

The big problem is Cowen’s use of “model” in the last two sentences. Large language models are not causal models like those economists use. They don’t tell us anything about how language and thought work. They are algorithmic models. They are about turning input into output; just how that is done is a mystery. Until we understand the internal operations of LLMs they tell us almost nothing about language and reasoning. They give a boost to the idea that some kind of statistical process is involved, but that’s it.

This situation is deeply paradoxical. These algorithmic models perform much better than the computer models created during the “classical” era of cognitive science, the 1960s and 1970s, models that were based on linguistic theory. We knew how those models worked. We don’t know how these models work. We have purchased performance at the cost of ignorance – a formulation I have from the late Martin Kay.

The Marginal Revolution: Rise and Decline, and the Pending AI Revolution, p. 107:

The classic breakthrough paper behind LLMs was a 2017 study titled “Attention is All You Need,” where in this context attention is defined by GPT-4 as “a mechanism that learns to focus selectively on parts of an input sequence, giving it ‘attention,’ while encoding a sentence or piece of information. This allows the model to treat different words or characters with different levels of importance, providing a ‘weight’ that aids in better understanding and decoding of information.” The paper was not titled “More Linguists are All You Need,” or for that matter “Marginalism is All You Need.” In other works, given some of the most complex human systems, we came up with ways of understanding them that were new. To be clear, neural nets were not new, since the ideas and also the practice (in much weaker form) have been around for decades. High-powered, well-functioning neural nets, however, are new in the contexts of providing excellent results for general linguistic ability and general reasoning.

Once again Cowen trips over that paradox. Until we understand how LLMs work, we have gained no new understanding. Just as he rejected the knowledge and intuitions of linguists in that first passage, so he doubles down on that in this paragraph.

Now Cowen turns his attention to chess, which has interested him since childhood, when he became something of a chess prodigy. It is through chess that he first became interested in AI. He saw a chess computer back in 1975, when he was 13, and has been following AI and AI-in-chess since then. While he knows that chess has been a central domain in AI since the beginning, he tells us nothing about that history in the following passage, pp. 108-109:

We even are finding new ways to model the game of chess, and we are doing so without any particular chess understanding. As of 2024, it is possible to produce “Grandmaster-Level Chess Without Search,” courtesy of Google DeepMind.

What exactly does that mean? One intuitive way of expressing the result is that AIs can play top-level chess without understanding anything about chess, and without searching through trees of chess moves. A typical top chess engine, such as Stockfish, will search different parts of the decision tree, make an evaluation of different possible positions, and then choose the best move. It is not hard to set up or access different ways of watching to evaluate the most desirable parts of the decision tree. The quality of the chess engine depends on its computing power, the quality of its pruning algorithms, the complex heuristics it has been fed, whether the right “fixes” have been applied to it, and more. Nonetheless in very broad terms it can be said that the engine evaluates sequences of moves, in broad terms, as a skilled human would do. It searches through decision trees.

The new innovation from DeepMind dispenses with all of this. It is a transformer model that was fed 15 billion pieces of data, namely individual chess positions graded by Stockfish as to which player is better or worse and by how much. In AI lingo, it can be said that the large and for the time being unprecedented number of chess data points – 15 billion – represents a major investment in scaling. In the vernacular, it could be said that more is being stuffed down the throat of the beast.

Transformers then do as transformers will do, namely they use those data points to figure out which features of positions are appealing ones, or not, and those “judgments” get coded into neural nets, which then generate the subsequent decisions. In blitz play, this creature can achieve an Elo chess rating of 2895, which makes it competitive with the top humans. It also can solve a significant fraction of difficult chess puzzles. It is not as good as Stockfish 16, which beats the top humans virtually all of the time, but this particular technique is being realized for the first time. If DeepMind decides it merits further investment, presumably later generations of this technique will be stronger yet. As can usually be said about AI, you are currently witnessing the weakest version of the thing you ever will see.

While Cowen says little about the AI research tradition, his second paragraph summarizes a central facet, that it is based on explicit knowledge of the basic rules augmented by heuristics that capture tactical and strategic play. As Cowen knows well, this tradition reached its peak in 1997 when IBM’s Deep Blue defeated world champion Gary Kasparov in a six-game match. Then, considerably later, DeepMind created AlphaZero (2017), which is based on engines originally designed to play Go. AlphaZero uses a combination of neural networks and Monte Carlo tree search.

Note that while Cowen does mention search at the beginning of this paragraph, he says nothing about it. That, I suspect, is because search is central classical AI, and to the computational linguistics ignored by the creators of LLMs. Cowen thus continues to distance himself from that whole body of science and technology.

And so in the third and fourth paragraphs he calls our attention to a chess program that is transformer-based, like LLMs. In the fourth paragraph he tells us that, while this chess program is very good, it’s not as good as the best humans, nor as good as Stockfish. He says nothing about AlphaZero, which is better than Stockfish and better than the best humans, by a long shot.

But I want to focus on a phrase in that third paragraph, “... a major investment in scaling.” Why does Cowen emphasize “scaling” by italicizing it? The only answer that makes any sense is that he is signaling his knowledge of that major debate, the one about whether or not scaling alone is sufficient to propel the full development of AI. Cowen knows that debate exists – anyone who reads his Marginal Revolutions blog knows that he knows – but for some reason he doesn’t want to acknowledge it in this monograph. He’s building a case and will brook no interference. He emphasizes that attitude – for that’s where are now, attitude, and not reasoned argument – in the phrase he uses to end that paragraph, “In the vernacular...more is being stuffed down the throat of the beast.” Where did that come from? It doesn’t add any new information; that’s there in the previous sentence, where he uses the casual term, “AI lingo.”

I asked the AI linked to Cowen’s monograph about that passage. Here’s what it said:

The vernacular translation — “more is being stuffed down the throat of the beast” — is also doing something. It’s simultaneously vivid and slightly mocking. “The beast” is not a neutral metaphor. It suggests something ungainly, voracious, and not fully understood. That’s not the language of a true scaling believer. A pure scaling optimist would not describe the process that way. So you have the italicized “scaling” nodding to the debate, and then immediately a colloquial deflation of the whole enterprise. The two sentences together suggest Tyler has more ambivalence about scaling than the chapter’s overall forward momentum implies.

Agreed. Now consider this remark in a post Cowen made on March 6, 2024:

In the structure of current debates, the concept of “AGI” plays a counterproductive role. You might think the world truly changes once we reach such a thing. That means the doomsters will be reluctant to admit AGI has arrived, because imminent doom is not evident. The Gary Marcus-like skeptics also will be reluctant to admit AGI has arrived, because they have been crapping on the capabilities for years. In both cases, the stances on AGI tell you more about the temperaments of the commentators than about any capabilities of the beast itself.

First of all, he acknowledges that Marcus exists (not the only place in the blog where he mentions him by name). Note how his use of the word “crapping” to characterize Marcus’s remarks. That is hardly a neutral metaphor. Notice, as well, that he refers the technology as “the beast.” To use Cowen’s own term, this is mood affiliation, not reasoned argument.

Distributional semantics, vector semantics, word embedding, and transformers

Just as Cowen buys the dominant Silicon Valley story that LLMs are the royal road to wherever AI is ultimately headed, so he buys into the self-mythologizing which would have us believe that, just as Athena emerged full-grown from the head of Zeus, so LLMs emerged fully grown out of Silicon Valley labs in the late 2010s with little linkage to anything that had come before – other than neural nets, of course, which date back to the 1950s.

That’s not true. Yes, LLMs did not originate from linguistic theory or psychological models of thought, they are grounded in statistical thinking about language and meaning that can be traced back to the 1950s. In fact, one could push the date back to Warren Weaver’s 1949 memorandum, “Translation.” One can also see hints in a 1951 paper by Claude Shannon, “Prediction and Entropy in Printed English.” Then we have a 1954 paper by Zellig Harris (who was Chomsky’s teacher) entitled “Distributional structure.” The basic principle has been most famously characterized by the British linguist, J. R. Firth, “You shall know a word by the company it keeps” from a 1957 paper. That is to say, when you examine the many contexts in which a word appears you’ll see a relatively small handful of other words appear along with it in relatively close proximity.

The computer scientist Gerard Salton then put this insight to computational use in a series of papers he began publishing in the late 1960s. He was interested in document retrieval. His approach was to represent both the documents and the queries as vectors of words. The query vector could then be compared with the document vectors and the closest matches would be returned. That is the origin of the vector space model of word meaning that became ubiquitous in natural language processing and machine learning, where it then became elaborated into word embedding models based on foundational research by Yoshua Bengio. The transformer architecture then extends the range of context through the use of the attention mechanism.

And, as we know, the transformer architecture is the basis of contemporary LLMs. On the basis of this statistical lineage alone LLMs are the result of a continuous tradition of research that is linked to linguistics back in the 1950s. Not the linguistics of formal grammars, but linguistics, nonetheless. Here’s the important point: Neural networks provide a learning mechanism and statistical linguistics provides a way of representing language so that it can be learned by neural nets.

It wouldn’t have been difficult to do three things: 1) acknowledge that there is controversy about the adequacy of LLMs as the centerpiece of future AI, 2) acknowledge that, as long as we don’t understand how LLMs work, their mere existence tells us little about the nature of language and thought, and 3) acknowledge that LLMs depend on over a half-century of work on statistical understanding of language. Why didn’t Cowen do that? In the annoying manner of mathematics texts, I leave that question as an exercise for the reader.

No comments:

Post a Comment