NEW SAVANNA: scaling_in

Showing posts with label scaling_in_ML. Show all posts

Thursday, January 8, 2026

The compute theory of everything

Samuel Albanie, Reflections on 2025, December 30, 2025. The first section, of three, is entitled "The Compute Theory of Everything." Here's an excerpt:

I have come to believe that every engineer must walk the road to Damascus in their own time. One does not simply adopt the Compute Theory of Everything by hearing others discuss it. You have to be viscerally shocked by the pyrotechnics of scale in a domain you know too well to be easily impressed.

For many senior engineers, that shock arrived in 2025. I have watched colleagues who were publicly sceptical through 2023 and 2024 quietly start to integrate these systems into their daily work. The “this is just a stochastic parrot” grimace has been replaced by the “this stochastic parrot just fixed my RE2 regex”. They still say “this can’t do what I do”, but the snorts of laughter have been replaced with a thoughtful silence and the subtle refreshing of their LinkedIn profile.

My own conversion came earlier. It is a privilege of my career that I was working in one of the first fields to get unceremoniously steamrollered by scaling: Computer Vision. During a glorious period at the VGG in Oxford, I spent months crafting bespoke, artisanal architectural inductive biases. They were beautiful, clever, and they had good names. And then, in early 2021, my approach was obliterated by a simple system that worked better because it radically scaled up pretraining compute1. I spent a full afternoon walking around University Parks in shock. But by the time I reached the exit, the shock had been replaced by the annoying zeal of a convert.

Returning to my desk, it did not take long to discover that the Compute Theory of Everything is 50 years old and has been waiting patiently in a Stanford filing cabinet since the Ford administration.

In 1976, Hans Moravec wrote an essay called “The Role of Raw Power in Intelligence“, a document that possesses both the punch and the subtlety of a hand grenade. It is the sort of paper that enters the room, clears its throat, and informs the entire field of Artificial Intelligence that their fly is down. Moravec’s central thesis is that intelligence is not a mystical property of symbol manipulation, but a story about processing power, and he would like to explain this to you, at length, using log scales and a tone of suppressed screaming.

He starts with biology, noting that intelligence has evolved somewhat independently in at least four distinct lineages: in cephalopods, in birds, in cetaceans, and in primates. He spends several pages on the brainy octopus covering the independent evolution of copper-based blood and the neural architecture of the arms, citing a documentary in which an octopus figures out how to unscrew a bottle to retrieve a tasty lobster from inside. One gets the impression he prefers the octopus to many of his colleagues. The evolutionary point is that intelligence is not a fragile accident of primate biology. It is a recurring architectural pattern the universe stumbles upon whenever it leaves a pile of neurons unattended. The octopus and the crow did not copy each other’s homework. Instead, they converged on the answer because the answer works. The question is: what is the underlying resource?

Moravec’s answer is: it’s the compute, stupid.

To make his point, he compares the speed of the human optic nerve (approximately ten billion edge-detection operations per second) to the PDP-10 computers then available at Stanford. The gap is a factor of more than a million. He calls this deficit “a major distorting influence in current work, and a reason for disappointing progress.” He accuses the field of wishful thinking, scientific snobbery, and (my favourite) sweeping the compute deficit under the rug “for fear of reduced funding.” It is the sound of a man who has checked the numbers, realized the Emperor has no clothes, and is particularly annoyed that the Emperor has neither a GPU nor a meaningful stake in God’s Chosen Company: Nvidia (GCCN).

This leads to his aviation aphorism that has become modestly famous, at least among the demographic that reads 1976 robotics working papers for recreational purposes: “With enough power, anything will fly.” Before the Wright brothers, serious engineers built ornithopters (machines that flapped their wings, looked elegant, and stayed resolutely on the ground). Most failed. Some fatally. The consensus was that AI was a matter of knowledge representation and symbolic reasoning, and that people who talked about “raw power” were missing the point and possibly also the sort of people who enjoy watching videos of monster truck rallies (a group that includes your humble author). Moravec’s point was that the Symbolic AI crowd were busy building ornithopters, obsessing over lift-to-drag ratios, while the solution was to strap a massive engine to a plank and give researchers the chance to brute-force the laws of physics into submission.

Twenty-two years later, he published an update. “When Will Computer Hardware Match the Human Brain?“ which opens with a sentence that has aged like a 1998 Pomerol:

“The performance of AI machines tends to improve at the same pace that AI researchers get access to faster hardware.”

He plots curves, whips up a Fermi estimate that human-level cognition requires on the order of 100 million MIPS, and predicts this capability will be available in affordable machines by the 2020s. The paper includes a chart in which various organisms and machines are arrayed by estimated computational throughput. The spider outperforms the nematode by a humiliating margin. Deep Blue appears as a reference point for what IBM’s R&D budget bought you in 1997, which was the ability to defeat Garry Kasparov at chess while remaining unable to recognise a photograph of a chess piece. The figure is instructive, but after staring at it for a few minutes, it can start to grate on one’s sensibilities. Perhaps because it treats the human soul as an arithmetic problem. Philosophy on two axes.

There's more on the compute theory of everything, which is worth your while.

Let me add that, for myself, sure, we need enough compute. That's necessary, but not sufficient. LLMs are a limited architecture. Throwing more compute at them isn't going solve all the problems. I've got a working paper that's relevant: What Miriam Yevick Saw: The Nature of Intelligence and the Prospects for A.I., A Dialog with Claude 3.5 Sonnet. Here's Claude's summary:

Gary Marcus vindicated on the limits of scaling?

He seems to think so, and I agree. Though I also believe that LLMs probably have won a permanent place in the repertoire of techniques for AI devices. We just have to figure out how best to use them.

Here’s Marcus’s most recent post: Satya Nadella and the three stages of scientific truth. You know the three stages: First the idea is ridiculed, which happened with Marcus’s 2022 paper in which he declared that LLMs would hit a wall. In the second stage, the idea opposed. In the third stage the idea wins, as though we’d known it all along.

Marcus quotes Microsoft’s CEO Satya Nadella:

So now in fact there is a lot of debate. In fact just in the last multiple weeks there is a lot of debate or have we hit the wall with scaling laws. Is it gonna continue? Again, the thing to remember at the end of the day these are not physical laws. There are just empirical observations that hold true just like Moore’s law did for a long period of time and so therefore it’s actually good to have some skepticism some debate because that I think will motivate more innovation on whether its model architectures or whether its data regimes or even system architecture.

Marcus notes that Marc Andreeseen and Alexandr Wang have made similar statements.

Monday, November 11, 2024

Tom Dietrich on the current evolution of AI

Posted in a Substack conversation here:

An alternative view of what is happening is that we have been passing through three different phases of LLM-based development.

In Phase 1, "scaling is all you need" was the dominant view. As data, network size, and compute scaled, new capabilities (especially in-context learning) emerged. But each increment in performance required exponentially more data and compute.

In Phase 2, "scaling + external resources is all you need" became dominant. It started with RAG and toolformer, but has rapidly moved to include invoking python interpreters and external problem solvers (plan verifiers, wikipedia fact checking, etc.).

In Phase 3, "scaling + external resources + inference compute is all you need". I would characterize this as the realization that the LLM only provides part of what is needed for a complete cognitive system. OpenAI doesn't call it this, but we could view o1 as adopting the impasse mechanism of SOAR-style architectures. If the LLM has high uncertainty after a single forward pass through the model, it decides to conduct some form of forward search combined with answer checking/verification to find the right answer. In SOAR, this generates a new chunk in memory, and perhaps in OpenAI, they will salt this away as a new training example for periodic retraining. The cognitive architecture community has a mature understanding of the components of the human cognitive architecture and how they work together to achieve human general intelligence. In my view, they give us the best operational definition of AGI. If they are correct, then building a cognitive architecture by combining LLMs with the other mechanisms of existing cognitive architectures is likely to produce "AGI" systems with capabilities close to human cognitive capabilities.

Thursday, October 3, 2024

Problems with so-called AI scaling laws

Arvind Narayanan and Sayash Kapoor, AI Scaling Myths, AI Snake Oil, June 27, 2024. The introduction:

So far, bigger and bigger language models have proven more and more capable. But does the past predict the future?

One popular view is that we should expect the trends that have held so far to continue for many more orders of magnitude, and that it will potentially get us to artificial general intelligence, or AGI.

This view rests on a series of myths and misconceptions. The seeming predictability of scaling is a misunderstanding of what research has shown. Besides, there are signs that LLM developers are already at the limit of high-quality training data. And the industry is seeing strong downward pressure on model size. While we can't predict exactly how far AI will advance through scaling, we think there’s virtually no chance that scaling alone will lead to AGI.

Under the heading, "Scaling “laws” are often misunderstood", they note:

Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases.

Emergence is not governed by any law-like behavior. It is true that so far, increases in scale have brought new capabilities. But there is no empirical regularity that gives us confidence that this will continue indefinitely.

Why might emergence not continue indefinitely? This gets at one of the core debates about LLM capabilities — are they capable of extrapolation or do they only learn tasks represented in the training data? The evidence is incomplete and there is a wide range of reasonable ways to interpret it. But we lean toward the skeptical view.

There is much more under the following headings:

• Trend extrapolation is baseless speculation
• Synthetic data is not magic
• Models have been getting smaller but are being trained for longer
• The ladder of generality

These remarks are from the section on models getting smaller:

In other words, there are many applications that are possible to build with current LLM capabilities but aren’t being built or adopted due to cost, among other reasons. This is especially true for “agentic” workflows which might invoke LLMs tens or hundreds of times to complete a task, such as code generation.

In the past year, much of the development effort has gone into producing smaller models at a given capability level. Frontier model developers no longer reveal model sizes, so we can’t be sure of this, but we can make educated guesses by using API pricing as a rough proxy for size. GPT-4o costs only 25% as much as GPT-4 does, while being similar or better in capabilities. We see the same pattern with Anthropic and Google. Claude 3 Opus is the most expensive (and presumably biggest) model in the Claude family, but the more recent Claude 3.5 Sonnet is both 5x cheaper and more capable. Similarly, Gemini 1.5 Pro is both cheaper and more capable than Gemini 1.0 Ultra. So with all three developers, the biggest model isn’t the most capable!

Training compute, on the other hand, will probably continue to scale for the time being. Paradoxically, smaller models require more training to reach the same level of performance. So the downward pressure on model size is putting upward pressure on training compute.

Check out the newsletter, AI Snake Oil, and the book of the same title.

Monday, July 1, 2024

Scalable LLMs without Matrix Multiplication

Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, et al., Scalable MatMul-free Language Modeling, arXiv:2406.02528v5 [cs.CL]

Abstract: Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at this https URL.

Friday, June 21, 2024

Scaling the size of LLMs yields sharply diminishing persuasive returns

We find that:

➡️ current frontier models (GPT-4, Claude-3) are barely more persuasive than models smaller in size by an order of magnitude or more, and

➡️ mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage.
— Kobi Hackenburg (@KobiHackenburg) June 21, 2024

Saturday, April 13, 2024

Dario Amodei on "AGI" and the exponential curve [Beware the intellectual monoculture]

Ezra Klein, What if Dario Amodei Is Right About A.I.?NYTimes, Apr. 12, 2024.

AGI

Let's skip over a lot of stuff to get to AGI:

EZRA KLEIN: You don’t love the framing of artificial general intelligence, what gets called A.G.I. Typically, this is all described as a race to A.G.I., a race to this system that can do kind of whatever a human can do, but better. What do you understand A.G.I. to mean, when people say it? And why don’t you like it? Why is it not your framework?

DARIO AMODEI: So it’s actually a term I used to use a lot 10 years ago. And that’s because the situation 10 years ago was very different. 10 years ago, everyone was building these very specialized systems, right? Here’s a cat detector. You run it on a picture, and it’ll tell you whether a cat is in it or not. And so I was a proponent all the way back then of like, no, we should be thinking generally. Humans are general. The human brain appears to be general. It appears to get a lot of mileage by generalizing. You should go in that direction.

And I think back then, I kind of even imagined that that was like a discrete thing that we would reach at one point. But it’s a little like, if you look at a city on the horizon and you’re like, we’re going to Chicago, once you get to Chicago, you stop talking in terms of Chicago. You’re like, well, what neighborhood am I going to? What street am I on?

And I feel that way about A.G.I. We have very general systems now. In some ways, they’re better than humans. In some ways, they’re worse. There’s a number of things they can’t do at all. And there’s much improvement still to be gotten. So what I believe in is this thing that I say like a broken record, which is the exponential curve. And so, that general tide is going to increase with every generation of models.

And there’s no one point that’s meaningful. I think there’s just a smooth curve. But there may be points which are societally meaningful, right? We’re already working with, say, drug discovery scientists, companies like Pfizer or Dana-Farber Cancer Institute, on helping with biomedical diagnosis, drug discovery. There’s going to be some point where the models are better at that than the median human drug discovery scientists. I think we’re just going to get to a part of the exponential where things are really interesting.

Just like the chat bots got interesting at a certain stage of the exponential, even though the improvement was smooth, I think at some point, biologists are going to sit up and take notice, much more than they already have, and say, oh, my God, now our field is moving three times as fast as it did before. And now it’s moving 10 times as fast as it did before. And again, when that moment happens, great things are going to happen.

And we’ve already seen little hints of that with things like AlphaFold, which I have great respect for. I was inspired by AlphaFold, right? A direct use of A.I. to advance biological science, which it’ll advance basic science. In the long run, that will advance curing all kinds of diseases. But I think what we need is like 100 different AlphaFolds. And I think the way we’ll ultimately get that is by making the models smarter and putting them in a position where they can design the next AlphaFold.

I like the cities analogy. And, while he doesn't say much about that exponential curve here, he has earlier.

Scaling

As far as I can tell he thinks scaling will take us "to infinity and beyond," to quote Buzz Lightyear. Color me skeptical. I think scaling will top out at some point in the next decade or two. Just what range of behaviors AI will represent at that point, I don't know. Scaling up machine learning has taken us to a new region of the space, but I don't see any reason to believe that it exhausts the space.

Here's what bothers me, the belief in scaling (from earlier in the dialog):

DARIO AMODEI: Yes, we’re going to have to make bigger models that use more compute per iteration. We’re going to have to run them for longer by feeding more data into them. And that number of chips times the amount of time that we run things on chips is essentially dollar value because these chips are — you rent them by the hour. That’s the most common model for it. And so, today’s models cost of order $100 million to train, plus or minus factor two or three.

The models that are in training now and that will come out at various times later this year or early next year are closer in cost to $1 billion. So that’s already happening. And then I think in 2025 and 2026, we’ll get more towards $5 or $10 billion.

EZRA KLEIN: So we’re moving very quickly towards a world where the only players who can afford to do this are either giant corporations, companies hooked up to giant corporations — you all are getting billions of dollars from Amazon. OpenAI is getting billions of dollars from Microsoft. Google obviously makes its own.

You can imagine governments — though I don’t know of too many governments doing it directly, though some, like the Saudis, are creating big funds to invest in the space. When we’re talking about the model’s going to cost near to $1 billion, then you imagine a year or two out from that, if you see the same increase, that would be $10-ish billion. Then is it going to be $100 billion? I mean, very quickly, the financial artillery you need to create one of these is going to wall out anyone but the biggest players.

DARIO AMODEI: I basically do agree with you. I think it’s the intellectually honest thing to say that building the big, large scale models, the core foundation model engineering, it is getting more and more expensive. And anyone who wants to build one is going to need to find some way to finance it. And you’ve named most of the ways, right? You can be a large company. You can have some kind of partnership of various kinds with a large company. Or governments would be the other source.

I think one way that it’s not correct is, we’re always going to have a thriving ecosystem of experimentation on small models. For example, the open source community working to make models that are as small and as efficient as possible that are optimized for a particular use case. And also downstream usage of the models. I mean, there’s a blooming ecosystem of startups there that don’t need to train these models from scratch. They just need to consume them and maybe modify them a bit.

$100 (Klein's number, not Amodei's) to train one model? That's a lot of money, and at the moment those decisions are being made by a relatively small group of people who ideas are dominated by the bigger-is-better-Foundation-model culture that dominates A.I. these days. That makes me very uncomfortable.

Too much power

Judging from some remarks Amodei makes later in the dialog, it makes him uncomfortable as well:

DARIO AMODEI: ...if these predictions on the exponential trend are right, and we should be humble — and I don’t know if they’re right or not. My only evidence is that they appear to have been correct for the last few years. And so, I’m just expecting by induction that they continue to be correct. I don’t know that they will, but let’s say they are. The power of these models is going to be really quite incredible.

And as a private actor in charge of one of the companies developing these models, I’m kind of uncomfortable with the amount of power that that entails. I think that it potentially exceeds the power of, say, the social media companies maybe by a lot.

You know, occasionally, in the more science fictiony world of A.I. and the people who think about A.I. risk, someone will ask me like, OK, let’s say you build the A.G.I. What are you going to do with it? Will you cure the diseases? Will you create this kind of society?

And I’m like, who do you think you’re talking to? Like a king? I just find that to be a really, really disturbing way of conceptualizing running an A.I. company. And I hope there are no companies whose C.E.O.s actually think about things that way.

I mean, the whole technology, not just the regulation, but the oversight of the technology, like the wielding of it, it feels a little bit wrong for it to ultimately be in the hands — maybe I think it’s fine at this stage, but to ultimately be in the hands of private actors. There’s something undemocratic about that much power concentration.

EZRA KLEIN: I have now, I think, heard some version of this from the head of most of, maybe all of, the A.I. companies, in one way or another. And it has a quality to me of, Lord, grant me chastity but not yet.

Which is to say that I don’t know what it means to say that we’re going to invent something so powerful that we don’t trust ourselves to wield it. I mean, Amazon just gave you guys $2.75 billion. They don’t want to see that investment nationalized.

No matter how good-hearted you think OpenAI is, Microsoft doesn’t want GPT-7, all of a sudden, the government is like, whoa, whoa, whoa, whoa, whoa. We’re taking this over for the public interest, or the U.N. is going to handle it in some weird world or whatever it might be. I mean, Google doesn’t want that.

And this is a thing that makes me a little skeptical of the responsible scaling laws or the other iterative versions of that I’ve seen in other companies or seen or heard talked about by them, which is that it’s imagining this moment that is going to come later, when the money around these models is even bigger than it is now, the power, the possibility, the economic uses, the social dependence, the celebrity of the founders. It’s all worked out. We’ve maintained our pace on the exponential curve. We’re 10 years in the future.

Interpretability

DARIO AMODEI: And one of the things we and others have found is that, sometimes, there are specific neurons, specific statistical indicators inside the model, not necessarily in its external responses, that can tell you when the model is lying or when it’s telling the truth.

And so at some level, sometimes, not in all circumstances, the models seem to know when they’re saying something false and when they’re saying something true. I wouldn’t say that the models are being intentionally deceptive, right? I wouldn’t ascribe agency or motivation to them, at least in this stage in where we are with A.I. systems. But there does seem to be something going on where the models do seem to need to have a picture of the world and make a distinction between things that are true and things that are not true.

If you think of how the models are trained, they read a bunch of stuff on the internet. A lot of it’s true. Some of it, more than we’d like, is false. And when you’re training the model, it has to model all of it. And so, I think it’s parsimonious, I think it’s useful to the models picture of the world for it to know when things are true and for it to know when things are false.

And then the hope is, can we amplify that signal? Can we either use our internal understanding of the model as an indicator for when the model is lying, or can we use that as a hook for further training? And there are at least hooks. There are at least beginnings of how to try to address this problem.

EZRA KLEIN: So I try as best I can, as somebody not well-versed in the technology here, to follow this work on what you’re describing, which I think, broadly speaking, is interpretability, right? Can we know what is happening inside the model? And over the past year, there have been some much hyped breakthroughs in interpretability.

And when I look at those breakthroughs, they are getting the vaguest possible idea of some relationships happening inside the statistical architecture of very toy models built at a fraction of a fraction of a fraction of a fraction of a fraction of the complexity of Claude 1 or GPT-1, to say nothing of Claude 2, to say nothing of Claude 3, to say nothing of Claude Opus, to say nothing of Claude 4, which will come whenever Claude 4 comes.

We have this quality of like maybe we can imagine a pathway to interpreting a model that has a cognitive complexity of an inchworm. And meanwhile, we’re trying to create a superintelligence. How do you feel about that? How should I feel about that? How do you think about that?

DARIO AMODEI: I think, first, on interpretability, we are seeing substantial progress on being able to characterize, I would say, maybe the generation of models from six months ago. I think it’s not hopeless, and we do see a path. That said, I share your concern that the field is progressing very quickly relative to that.

And we’re trying to put as many resources into interpretability as possible. We’ve had one of our co-founders basically founded the field of interpretability. But also, we have to keep up with the market. So all of it’s very much a dilemma, right? Even if we stopped, then there’s all these other companies in the U.S.. And even if some law stopped all the companies in the U.S., there’s a whole world of this.

There's much more in the discussion. Persuasion is one (scary) topic. Energy usage is another. Copyright and economic displacement too.

Thursday, July 21, 2022

Effects of scaling model parameters, but not number of tokens

Rohin Shah, [AN #173] Recent language model results from DeepMind, LessWrong, July 20, 2022.

Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the GPT series (AN #102): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example, book data is 27% of Gopher’s training data but only 16% of GPT-3’s training data). [...]

The most interesting aspect of the paper (to me) is that the entire Gopher family of models were all trained on the same number of tokens, thus allowing us to study the effect of scaling up model parameters (and thus training compute) while holding data constant. Some of the largest benefits of scale were seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories, while scale has not much effect or even a negative effect in the Maths, Logical Reasoning, and Common Sense categories. Surprisingly, we see improved performance on TruthfulQA (AN #165) with scale, even though the TruthfulQA benchmark was designed to show worse performance with increased scale.

There's more at the link.

Chart: Gopher vs. SOTA, from the original paper, p. 9.

Tuesday, July 5, 2022

The limitations of scale in deep learning

The power of scale is a blessing for ML. Indeed, scale has been the tool behind most major advances, starting with AlexNet. But it is also a curse: it is physically impossible to maintain our current rate of progress if scale is our only tool.
— Tom Goldstein (@tomgoldsteincs) July 5, 2022

Friday, July 1, 2022

Just what is intelligence, anyhow? [Is it simply a matter of scale?]

The Aaronson/Pinker debate on AI scaling generated a lot of commentary, including some from me and some from NYU’s Ernie Davis, who works closely with Gary Marcus. I’ve gathered some of those together in this post. But first....

FYI: "If we do try to define “intelligence” in terms of mechanism rather than magic, it seems to me it would be something like “the ability to use information to attain a goal in an environment.”" this is how John McCarthy defined intelligence decades ago (almost verbatim)
— Madame Pratolungo joining Mastodon (@MadamePratolung) July 1, 2022

What’s interesting is that that definition defines intelligence as a relation between some device (natural or artificial) and the environment in which it operates. That relationship has been dogging AI for some time.

Moravec’s paradox

Here is my first contribution to the debate (comment #81):

There is a song lyric, "Fools rush in, where angels fear to tread." Call me a fool.

Scott #33:

...stepping back: my exchanges with you, Steve, and others have been useful for me, in clarifying how “the power or powerlessness of pure intellectual ability to shape the world” is really at the heart of the entire AGI debate.

Well, yes, though the first time I read that I gave it a very reductive reading where "pure intellectual ability" was something like "computational horsepower". However, the relationship between computational horsepower and pure intellectual ability (whatever that might be) is at best unspecified. However, computational horsepower is certainly at the center of current debats about scaling. And it's quite clear that the abundance of relatively cheap compute has been extraordinarily important.

Take chess, which has been at the center of AI since before the 1956 Dartmouth conference. Chess is a rather special kind of problem. From an abstract point of view it is no more difficult that tic-tac-toe. Both are finite games played on a very simple physical platform. However, the chess-tree is so very much larger than the tic-tac-toe tree that playing the game is challenging for even the most practiced adults, while tic-tac-toe challenges no one over the age of, what? seven?

However, the fact that the chess tree is generated from a relatively simple basic structure (on 64 squares, 32 pieces, highly restrictive rules) means that compute can be thrown at the problem in a relatively straight-forward way. And the availability of compute has been important in the conquest of chess. It's certainly not the only thing, but without it, we'd be stuck where we were well before Big Blue beat Kasparov.

In contrast, things like image recognition, machine translation, or common sense knowledge, those are quite different in character from chess. The number of possible images is unbounded and they're in all forms. Language, the number of word types may be finite, but it's not well-defined, and the number of different texts is unbounded. Common sense, the same. Throwing more and more compute at the problem helps, but computational approaches to those problems, and others like them, has not produced computers that perform at the Kasparov level, and better, in those respective domains.

This has been known for a long time, it has a name, Moravec’s paradox. I think we should keep it in mind during these discussions.

Note that Moravec’s paradox is about the nature of the environment in which computation is tasked with achieving goals. Some environments are more amenable to computational regimes we understand than others.

Ernie Davis on computational speed and intelligence

Here is his comment #18, in full:

Let me suggest the following thought experiment. Suppose we take some mediocre, stick-in-the-mud scientist from 1910 who rejected not just special relativity but also atomic theory, the kinetic theory of heat, and Darwinian evolution — there were, of course, quite a few such. Now speed him up by a factor of 1000. One’s intuition is that result would be thousands of mediocre papers, and no great breakthroughs. On the other hand, it doesn’t seem right to say that Einstein, Planck and so on were 1000 times more intelligent than him; in terms of measures like IQ, they may not have been at all less intelligent than him. So I am really doubtful that this speeding up process has much to do with genius in the sense of Einstein et al. And therefore I think your intuition about speeding up Einstein by a factor of 1000 is also wrong. Had we speeded up Einstein by a factor of 1000 during his lifetime starting in 1905, we might have gotten the great papers of 1905 within a day (as fast as he could physically write them) and general relativity within a week, (ignoring the fact that that involved interactions with non-speeded up people) but I don’t think you can be confident about how much more we would have gotten.

And some passages from his comment #24:

On the last point: I think that the terminology does matter, because the view that “intelligence” is a well-defined, scalar, characteristic of minds, shown in its highest degree by people of exceptional intellectual accomplishment, is an error, and not an innocuous one. There is really very little reason to think that the qualities of mind that made Jane Austen exceptional had anything at all in common with the quality of mind that made Ramanujan exceptional; or the qualities of mind that made Chopin, Emily Dickinson, William James, or Rachel Carson exceptional. [...]

Of course, if you take all of human history and, so to speak, videotape it and then run the video tape at 1000 x speed, then things happen 1000 times as fast. So what?

Indeed, so what?

What if searching for ideas is like searching for diamonds?

This is an idea I explored more extensively in a post from 2020, Stagnation, Redux: It’s the way of the world [good ideas are not evenly distributed, no more so than diamonds]. I subsequently incorporated that post into a working paper, What economic growth and statistical semantics tell us about the structure of the world.

Comment #108:

I would like to elaborate on the comment Ernie Davis made at #18, because I suspect he’s correct. I suspect that 1000X Einstein would have given us his great work rather quickly but that [he] would [then] have proceeded out into the same intellectual desert the real Einstein explored, but managed to explore it much more thoroughly, with, alas, the same success.

Just how are ideas distributed in idea space? (Is that even a coherent question?)

Let me suggest an analogy, diamonds. We know that they are not evenly distributed on or near the earth’s surface. Most of them seem to be in kimberlite (a type of rock) and that’s where diamond mines are located. Even there, they are few, far between, and irregularly located. So it takes a great deal of labor to find each diamond.

Now, imagine we have a robot that can find diamonds at 1000 times the rate human miners can, but only costs, say, 10 times or even 100 more times per hour. Such robots would be very valuable. Now, let’s place a bunch of these 1000X robots on some arbitrary chunk of land and let them dig and sort away. What are they going to find? Probably nothing. Why, because there are no diamonds there. They may be very good at excavating, moving, crushing, and sorting through earth, but if there are no diamonds there, the effort is wasted.

Perhaps ideas and idea space are like that. The ideas are unevenly distributed. We have no maps to guide us to them. But we have theories, and hunches, an intellectual style. Think of them collectively as a mapping procedure. So, Einstein had his intellectual style, his mapping procedure. That led to roughly a decade of important discoveries in his 20s and 30s, like diamond miners working in kimberlite. And then, nothing, like diamond miners working, say, in the middle of Vermont. Nice country, but no diamonds.

As for idea space, we can imagine it by analogy with chess space. But we know how to construct chess space, though it is too large for anything approaching a complete construction. And that knowledge allows us to construct useful procedures for searching it. We haven’t a clue about how to construct idea space, much less how to search it effectively. If speed is all we’ve got, it’s not clear how much that gets us in the general case.

It’s not at all obvious that we need the notion of idea space in the case of Einstein, and similar cases. Einstein’s just searching the world for a fit between his best thinking and natural phenomena. Chess space, of course, is different. It is entirely artificial; we created it when we created the game. The world Einstein explored pre-existed him (and us).

Five factors of genius/intelligence

Further response to Davis, # 18:

Suppose we take some mediocre, stick-in-the-mud scientist from 1910 who rejected not just special relativity but also atomic theory, the kinetic theory of heat, and Darwinian evolution — there were, of course, quite a few such. Now speed him up by a factor of 1000. One’s intuition is that result would be thousands of mediocre papers, and no great breakthroughs. On the other hand, it doesn’t seem right to say that Einstein, Planck and so on were 1000 times more intelligent than him; in terms of measures like IQ, they may not have been at all less intelligent than him.

Speed is one thing. And IQ is another. Einstein had something else. I suppose we could call it genius, in fact we do, don’t we? But that doesn’t tell us much.

For the sake of argument – I’m just making this up as I type – let’s say one aspect of that something else is intellectual technique. Einstein had more effective intellectual tactics and strategies than those standard investigators. Intellectual technique may, in turn, have a genetic aspect that’s not covered by IQ, but almost certainly has a learned aspect as well.

So now we have four things: 1) speed/compute, 2) IQ, 3) an inherited component of technique, and 4) a learned component of technique.

I’m going to posit one more thing, again, thinking off the top of my head. We might call it luck. Or, if we’re thinking in terms of something like idea space, we could call it initial position. By virtue of 1, 2, 3 and perhaps 4 as well, the so-called genius is at a position in idea space that allows them to make major discoveries by deploying their cumulative capabilities. The point of this initial-position factor is to allow for the possibility of a cohort of thinkers more or less equally endowed with 1,2,3+4, but having very different initial position. As a consequence, some are able to achieve major discoveries quickly, while others take more time, and still others never get there. Their capabilities are comparable, but their outcomes are not.

To invoke the diamond mining metaphor I introduced in comment #108, we have two equally skilled geologists/prospectors. One just happens to be located within 100 miles of a major kimberlite deposit while the other is over 3000 miles away from such a deposit. If they start walking from where they are, who’s going to find diamonds first?

In the case of AI, we know a great deal about compute/speed; we have that under control. I’m not sure just how the distinction between innate vs. learned techniques applies to machines, perhaps hardware and software. In any case, we do have a large repertoire of techniques of various kinds. In some areas we can produce a combination of compute and technique that allows the machine to outperform the best human. In other areas we have machines that do things that are amazing in comparison with what machines did, say, a decade ago, but which are no more than standard human performances, with various failings here and there. And so on. As for starting position, I think it’s up to us to position the AI properly, at least at the start.

[But once and if it FOOMs, it’s on its own. I’m not holding my breath on this one.]

Figure 5 in the 2020 New Savanna post gives a visual illustration of the initial position idea.

We now have a total of five factors:

1) compute,
2) IQ,
3) an inherited component of technique,
4) a learned component of technique, and
5) initial position or luck.

The seductiveness of scale

The hope of the scaling side of the current debate is that we can get all the way to AGI – whatever that is – by throwing more compute at the problem. Well, it’s not that simple, the compute has to be channeled through an appropriate machine-learning architecture which then chews its way to a huge pile of (appropriately curated) data. That is, architecture+data will cover the ground I’ve indicated in factors 2-5 above, thereby relieving us of the need and responsibility to think about those things.

It’s a seductive prospect. Why? In part because it is easy to understand, even by people who have little or no technical knowledge of computing, cognitive science, and AI. Everyone knows and understands, “bigger is better.”

GOFAI (good old fashioned artificial intelligence) was mostly about technique, factors 3 and 4. That technique was generally taken to be mediated by symbolic systems and, as a practical matter, it required that ‘knowledge’ be painstakingly hand-crafted into systems. While I can understand the desire to avoid hand-crafted knowledge – there’s so very much of it and the crafting is tedious and error-prone — I don’t think symbolic computation can be avoided. Can it be architected, as it were, into a learning regime? We know one case where it has been, the human case, but that case tells us that learning requires a lot of close interaction between teachers and students, in both formal and informal settings. It’s not at all clear to me that such interaction can be architected.

More later.

Addendum, 7.1.22, on superintelligence: Alex, comment #171:

I think Pinker’s definition of intelligence, “the ability to use information to attain a goal in an environment”, is reasonable, but it doesn’t give us any meaningful way to compare or rank intelligences (so how can we meaningfully discuss “superintelligence”?). Of course, you chose compute time as the metric, but I think that dodges the more meaningful aspects of intelligence. I think a metric like computational complexity – or even Kolmogorov complexity – is more appealing to me, but whatever the metric, I think it has to capture the mechanism of thought in some way, not just the output. [...]

As a final note, I think “intelligence” is a crude word that tries to capture too many aspects of behavior (many of them human-relatable, but not of great importance to discussion). My comment here has been an attempt to break up “intelligence” into constituent parts to focus discussion: clock speed, memory, algorithmic/time complexity, size/space complexity. There are surely more parts of “intelligence”, some parts that are combinations of simpler parts.

Scott, comment #172:

Fundamentally, I care, not about the definitions of words like “superintelligence,” but about what will actually happen in the real world once AIs become much more powerful. [...] So OK then, what happens when we can launch a billion processes in datacenters, each one with the individual insight of a Terry Tao or Edward Witten (or the literary talent of Philip Roth, or the musical talent of the Beatles…), and they can all communicate with one another, and they can work at superhuman speed? Is it not obvious that all important intellectual and artistic production shifts entirely to AIs, with humans continuing to engage in it (if they do) only as a hobby? That’s the main question I care about when I discuss “superintelligence,” and I’m still waiting for anyone to explain why I’m wrong about it.

Thursday, June 30, 2022

Steven Pinker and Scott Aaronson debate scaling

Scott Aaronson has hosted Steven Pinker to a discussion at Shtetl-Optimized.

Pinker on AGI:

Regarding the second, engineering question of whether scaling up deep-learning models will “get us to Artificial General Intelligence”: I think the question is probably ill-conceived, because I think the concept of “general intelligence” is meaningless. (I’m not referring to the psychometric variable g, also called “general intelligence,” namely the principal component of correlated variation across IQ subtests. This is a variable that aggregates many contributors to the brain’s efficiency such as cortical thickness and neural transmission speed, but it is not a mechanism (just as “horsepower” is a meaningful variable, but it doesn’t explain how cars move.) I find most characterizations of AGI to be either circular (such as “smarter than humans in every way,” begging the question of what “smarter” means) or mystical—a kind of omniscient, omnipotent, and clairvoyant power to solve any problem. No logician has ever outlined a normative model of what general intelligence would consist of, and even Turing swapped it out for the problem of fooling an observer, which spawned 70 years of unhelpful reminders of how easy it is to fool an observer.

If we do try to define “intelligence” in terms of mechanism rather than magic, it seems to me it would be something like “the ability to use information to attain a goal in an environment.” (“Use information” is shorthand for performing computations that embody laws that govern the world, namely logic, cause and effect, and statistical regularities. “Attain a goal” is shorthand for optimizing the attainment of multiple goals, since different goals trade off.) Specifying the goal is critical to any definition of intelligence: a given strategy in basketball will be intelligent if you’re trying to win a game and stupid if you’re trying to throw it. So is the environment: a given strategy can be smart under NBA rules and stupid under college rules.

Since a goal itself is neither intelligent or unintelligent (Hume and all that), but must be exogenously built into a system, and since no physical system has clairvoyance for all the laws of the world it inhabits down to the last butterfly wing-flap, this implies that there are as many intelligences as there are goals and environments. There will be no omnipotent superintelligence or wonder algorithm (or singularity or AGI or existential threat or foom), just better and better gadgets.

Aaronson responds:

Basically, one side says that, while GPT-3 is of course mind-bogglingly impressive, and while it refuted confident predictions that no such thing would work, in the end it’s just a text-prediction engine that will run with any absurd premise it’s given, and it fails to model the world the way humans do. The other side says that, while GPT-3 is of course just a text-prediction engine that will run with any absurd premise it’s given, and while it fails to model the world the way humans do, in the end it’s mind-bogglingly impressive, and it refuted confident predictions that no such thing would work.

Though I’m with Pinker on the definition of AGI, I also take the second of the two positions Aaronson set forth, which is, I take it, Aaronson’s position while the first is Pinker’s position. That’s why I wrote GPT-3: Waterloo or Rubicon? Here be Dragons (Version 4.1).

Aaronson continues:

I freely admit that I have no principled definition of “general intelligence,” let alone of “superintelligence.” To my mind, though, there’s a simple proof-of-principle that there’s something an AI could do that pretty much any of us would call “superintelligent.” Namely, it could say whatever Albert Einstein would say in a given situation, while thinking a thousand times faster. Feed the AI all the information about physics that the historical Einstein had in 1904, for example, and it would discover special relativity in a few hours, followed by general relativity a few days later. Give the AI a year, and it would think … well, whatever thoughts Einstein would’ve thought, if he’d had a millennium in peak mental condition to think them.

If nothing else, this AI could work by simulating Einstein’s brain neuron-by-neuron—provided we believe in the computational theory of mind, as I’m assuming we do. It’s true that we don’t know the detailed structure of Einstein’s brain in order to simulate it [...]. But that’s irrelevant to the argument. It’s also true that the AI won’t experience the same environment that Einstein would have—so, alright, imagine putting it in a very comfortable simulated study, and letting it interact with the world’s flesh-based physicists. A-Einstein can even propose experiments for the human physicists to do—he’ll just have to wait an excruciatingly long subjective time for their answers. But that’s OK: as an AI, he never gets old.

Next let’s throw into the mix AI Von Neumann, AI Ramanujan, AI Jane Austen, even AI Steven Pinker—all, of course, sped up 1,000x compared to their meat versions, even able to interact with thousands of sped-up copies of themselves and other scientists and artists. Do we agree that these entities quickly become the predominant intellectual force on earth—to the point where there’s little for the original humans left to do but understand and implement the AIs’ outputs (and, of course, eat, drink, and enjoy their lives, assuming the AIs can’t or don’t want to prevent that)?

Eh. Now that I have an explicit definition of artificial minds, I have no need for a definition of artificial intelligence. While my primer (Relational Nets Over Attractors, A Primer: Part 1, Design for a Mind) is mostly about the human mind and the human brain, the fact that I was able to propose a substrate-neutral definition of “mind” has the side-effect that I can talk about artificial minds as mechanisms, not magic, to use Pinker’s formulation.

Aaronson also notes:

I should clarify that, in practice, I don’t expect AGI to work by slavishly emulating humans—and not only because of the practical difficulties of scanning brains, especially deceased ones. Like with airplanes, like with existing deep learning, I expect future AIs to take some inspiration from the natural world but also to depart from it whenever convenient. The point is that, since there’s something that would plainly count as “superintelligence,” the question of whether it can be achieved is therefore “merely” an engineering question, not a philosophical one.

That is consistent with the view I have articulated in the primer.

Aaronson has more to say, as does Pinker. As of this moment, the dialog has attracted 100 comments (including two from me). It’s worth exploring.

Thursday, June 16, 2022

A shrewd take on the AI scaling debate

Some random musings on how the whole "AI scaling" debate. It is clear that neural nets can model any predictive distribution p(x_future|x_past) given enough parameters and data, since they are universal approximators;
— Kevin Patrick Murphy (@sirbayes) June 15, 2022

He continues:

so in that trivial sense "scale is all you need". But this will not be efficient in handling the combinatorial explosion of "edge cases" for which we do not have data (eg on the internet) that can simply be memorized.

BB Note: Come to think of it, one might argue that Kuhn's account of scientific revolutions is that one spots catalytic "edge cases" (the one's that Kuhn calls anomalies) and uses them to leverage a new paradigm into being.

To perform "strong generalization" - ie make reliable predictions under interventions and changing distributions - you have to learn (some approximation to) the underlying data generating mechanism in latent space, not just the induced marginals in visible space.

BB: By "underlying data generating mechanism" I assume he means us human beings, for we generated the texts in the corpus on which a given LLM is trained. And we definitely use symbolic means, though, as I have asserted, symbolic means ultimately grounded in Geoffrey Hinton's "big vectors of neural activity."

So while we can in principle learn everything just by optimizing predictions p(x), I claim it will be much more efficient (in terms of data and compute) to optimize over the space of plausible models of the world p(x|z).

Given a model with latent variables, we can make predictions in observation space, but we can also do counterfactual reasoning, and can come up with compressed and meaningful explanations of observed data (eg do scientific discovery).

So I disagree that "scale is all you need". Instead we also need models and assumptions about data generating mechanisms - but these need to be checked against reality by performing experiments.

End of rant :)