Tuesday, April 1, 2025

Superforcasting the future of AI [is the methodology appropriate? – a discussion with Claude 3.7]

Superforcasting is a technique for making predictions about world events that originated in The Good Judgment Project. I’ve known about it for a while and read about it here and there. Yesterday (March 31, 2025) Malcolm Murray published A Superforecaster’s View on AGI in 3 Quarks Daily. I won’t try to summarize the article, but I do want to extract a couple of things from it.

First he distinguishes between input and output definitions of AGI. Judging from his overall usage, he means input to and output from the economy, but he doesn’t explicitly say that. Input definitions attempt to characterize AGI by the cognitive tasks the system can perform. Most debates about AGI are about this. In contrast, output definitions characterize AGI by its effect on the economy: How does productivity increase?

After giving his definitions of both, he offers us this:

So, the two distinct questions we end up with are:

  • Will there exist by Dec 31, 2030, an AI that is able to do every cognitive digital task equivalently or better than the best human, in an equivalent or shorter time, and for an equivalent or cheaper cost?
  • Will by Dec 31, 2030, the U.S. have seen year-on-year full-year GDP growth rate of 19% or higher?
  • Overall, what I would argue is the most important

Superforecasting technique is to look at the historical base rates and weigh these heavily in the analysis. At a high level, this means looking to the past for answers about the future and assuming that the past either repeats itself or at least rhymes, as Mark Twain might have said. Not assuming that the future will be radically different than the past is something that has served forecasters well for a long time. However, there are some key nuances that must be kept in mind and important choices to make when establishing the base rate.

After explaining something of his reasoning process, he offers us this:

These two questions have separate base rates and will likely therefore resolve differently. My forecast, with 60% and 90% likelihood, respectively, is that the first one will resolve in the affirmative and the second one in the negative. I.e., by 2030, it seems that we might have “AGI” but not “TAI”.

Here AGI is his input definition and TAI (Transformative AI, “powerful enough that the economy changes significantly”). So, he things it is fairly likely that we’ll have something that looks and smells like AGI by 2130, but it is unlikely to have much effect on the economy.

Here's my response:

I've got intellectual stakes in this game. On "AGI," "ASI," "PASTA," my basic impulse is "a pox on all your houses." But if we must use a term, I pick "AGI." My own prediction, with fairly high confidence (say >75%), is that we'll have substantial understanding of how LLMs work internally before AGI is reached. I make no prediction on when either of those will be achieved. I do think it possible that we'll achieve such understanding without ever reaching AGI, or that AGI will come much later than such understanding.

What do I mean by "substantial understanding"? Tricky question. However that understanding is expressed (e.g. a formal mathematical theory, engineering standards) it will allow us to explicitly engineer systems to meet specified performance objectives given specified operational capacities and input data for training.

Note that in arriving at the >75% figure I didn't undertake any explicit process. That number is pure spitballing.

One could reasonably say that my comment is not responsive to the argument he made. Why? What it comes down to is that Murray seems to think that the causal processing leading to AGI is homogenous, my term, while I don’t think it is.

To see what that’s about I’m afraid you’re going to have to read a dialog I had with Claude 3.7 about the nature of superforcasting. As always in these conversations, my remarks are in bold face while Claude’s responses are in plain face. Note that some headings in Claude's response will be in bold face, but that should be obvious from the context.

* * * * *

The Nature of Superforcasting

I understand that there is a formal discipline of forecasting associated with Philip E. Tetlock and the book Superforecasting: The Art and Science of Prediction. I wonder where it has been most successful, what kinds of phenomena? For example, we aren’t going to use it to predict the locations of the planets. Why not? Because that’s a well understood physical system where we can calculate planetary motion. Similarly, we’re not going to use it to predict the weather, and for much the same reason. Though in this case the system is complex, in a technical sense, and our ability to predict the weather is limited by the amount of data we have for estimating an initial state and the amount of computing power we have to run the prediction. I would also say that the causal processes in those systems are homogeneous in the sense that there is only one kind of causal process involved. OTOH, its not at all obvious to me that the stock market is homogeneous in that sense. I suppose the causal process is one involving the collective judgements of people trading in the markets. Those people have all kinds of reasons for making trades based on a wide variety of data available to them, including personal superstition. So, where has superforcasting been most useful?

I'll explore where superforecasting has been most successful based on the distinctions you've drawn between different types of predictive systems.

Superforecasting, as developed by Philip Tetlock and colleagues, has been most successful in domains with specific characteristics that fall between the deterministic physical systems you mentioned (planetary motion, weather) and completely chaotic or random systems.

The most successful applications of superforecasting tend to be in areas with these features:

  1. Geopolitical events and international relations: Superforecasting has shown particular success in predicting political stability, leadership changes, and diplomatic conflicts. The Good Judgment Project (GJP) demonstrated superior accuracy in forecasting geopolitical events compared to intelligence analysts with access to classified information.
  2. Economic trends and indicators: While not perfectly predicting specific stock movements (as you noted), superforecasters have demonstrated skill in predicting broader economic indicators like GDP growth, inflation rates, and unemployment figures.
  3. Public health outcomes: The COVID-19 pandemic saw superforecasting techniques applied to predict case numbers, hospitalization rates, and vaccine development timelines with varying degrees of success.

What makes these domains suitable for superforecasting approaches?

  • They involve complex causal processes that combine both systematic patterns and human judgment/behavior
  • They have enough stability that expertise and historical patterns matter, but enough uncertainty that probabilistic thinking is valuable
  • They involve phenomena where updating forecasts based on new information improves accuracy
  • They benefit from aggregating diverse perspectives and avoiding cognitive biases

Superforecasting is less useful in domains where:

  • Physical laws completely determine outcomes (planetary motion)
  • Pure randomness dominates (lottery numbers)
  • The time horizon is extremely long
  • There's complete information asymmetry

The most successful superforecasting applications typically involve what Tetlock calls "Goldilocks problems" - not too hard, not too easy, where careful thinking and information aggregation can outperform both simple models and expert intuition. Would you like me to elaborate on any particular application domain or on the specific forecasting techniques that have proven most effective?

Supercasting the future of AI

In your judgment, is predicting the future of AI a plausible domain for superforcasting? Why or why not? Perhaps you could imagine a range of answers to that question, with reasoning appropriate to each answer within the range.

I think predicting the future of AI is a mixed case for superforecasting, with some aspects being amenable to this approach and others being more challenging. Let me explore a range of possible answers with reasoning for each:

Position 1: AI is well-suited for superforecasting techniques

This position holds that AI development has enough structure and pattern to make superforecasting valuable. Supporting reasons include:

  • AI development follows somewhat predictable research and deployment cycles
  • There's a large community of experts with diverse perspectives who could contribute forecasts
  • Short to medium-term developments (1-3 years) often build incrementally on existing capabilities
  • Many AI milestones involve measurable benchmarks (e.g., performance on specific tasks)
  • Historical patterns of technology adoption can inform diffusion predictions

Position 2: AI is moderately suited for superforecasting with significant limitations

This middle position acknowledges both strengths and weaknesses:

  • Near-term technical developments may be forecastable (e.g., when certain performance thresholds will be reached)
  • Commercial adoption patterns might follow historically similar technologies
  • However, discontinuous breakthroughs ("AI winters" and sudden advances) introduce unpredictability
  • The field involves interacting technical, economic, social, and regulatory factors
  • The most important questions often involve complex social impacts rather than just technical capabilities

Position 3: AI is poorly suited for superforecasting

This position argues that AI's future is fundamentally too uncertain:

  • Revolutionary breakthroughs can occur unexpectedly, creating discontinuities that break trend-based forecasting
  • The interplay between AI capabilities, human adaptation, and regulatory responses creates too many feedback loops
  • Long-term impacts involve deeply transformative possibilities that lack historical analogues
  • The most important questions about AI's future may be precisely those where we have the least precedent
  • The field suffers from both hype cycles and secrecy that distort information needed for accurate forecasting

My judgment is closest to Position 2. I believe superforecasting can be valuable for specific, well-defined questions about AI with moderate time horizons (1-5 years), particularly regarding technical capabilities, commercial adoption patterns, and regulatory developments. However, the farther out we look and the more transformative the potential impacts, the less reliable superforecasting becomes for AI.

The ideal approach might be to use superforecasting for specific near-term questions while acknowledging its limitations for more revolutionary or long-term questions about AI's societal impact. This targeted application recognizes that AI development combines both forecastable elements (research progress in established directions) and fundamentally unpredictable elements (novel breakthroughs, emergent capabilities, and societal responses).

The Scaling Hypothesis

I know that a number of experts in AI think that AGI (however one might define it) is almost certain within 5 to 10 years, or in some cases even sooner. My impression that they all believe in the scaling hypothesis, which holds that we more or less understand the techniques for building intelligent systems, even if we don’t understand how those systems operate, so that all we need to do is scale things up and use techniques from our current repertoire to do this. One might say that they are assuming a homogeneous causal process, albeit one that is in some respects not well understood (the actual mechanisms by which the systems operate). Skeptics on the other hand then to reject the scaling hypothesis. They tend to believe that we need discontinuous breakthroughs of one kind or another. Does this make sense?

Your framing of the AGI debate along the lines of the scaling hypothesis makes a lot of sense. Let me expand on these perspectives:

The Scaling Hypothesis View

Those who predict AGI within 5-10 years typically embrace what's called the "scaling hypothesis" - the idea that intelligence emerges primarily from scaling up existing approaches. Their reasoning generally includes:

  1. Recent AI advances have come largely from scaling model size, data, and compute rather than fundamentally new architectures
  2. Emergent capabilities have appeared as models scale up, suggesting intelligence may be an emergent property of sufficient scale
  3. The brain itself may be operating on relatively simple principles at scale, rather than using fundamentally different mechanisms

This perspective sees AI development as relatively homogeneous causally, as you suggested - continued scaling plus incremental improvements in architecture and training will eventually produce AGI.

The Discontinuity/Skeptical View

The skeptical position typically argues that:

  1. Current systems lack certain fundamental capabilities necessary for AGI (true understanding, causal reasoning, etc.)
  2. These missing capabilities require qualitatively different approaches, not just more scale
  3. The history of AI has shown that apparent progress often hits unexpected plateaus
  4. Intelligence may require embodiment, social learning, or developmental processes not present in current systems

This view sees AI development as causally heterogeneous - different types of breakthroughs would be needed in areas where scaling alone is insufficient.

Your framing highlights an important epistemological difference between these camps:

  • The scaling hypothesis camp believes we've identified the right type of causal process for intelligence (even if we don't fully understand it)
  • The discontinuity camp believes additional, qualitatively different causal processes are required

This makes the AGI timeline question particularly challenging for superforecasting, as it depends fundamentally on which of these differing causal models is more accurate - exactly the type of question where superforecasting tends to struggle with long-term predictions. Does this analysis align with your understanding of the debate? Would you like me to explore any particular aspect of this in more depth?

How my intellectual history affects my judgment

Well, yes, that's my understanding. Let me see. I’m a skeptic. Let me tell you about my history. I studied computational semantics under David Hays back in the 1970s. Hays was a first-generation researcher in machine translation, which became computational linguistics in the 1960s. Hays saw that field crash and burn when the funding dried up because it wasn’t producing the practical results that motivated that funding. In 1976 we published a review article on computational linguistics in which we argued that one day we would have a cognitive network model that would be capable of reading and understanding a Shakespeare play. That of course didn’t happen. A decade later symbolic computing was collapsing.

So, I participated in a line of research that led me to make an informal prediction about how it would go. That prediction proved to be wrong. That, in turn, means, to speak rather informally, that I had to “adjust my priors” since they proved to be wrong. I have done so.

Most of the current researchers in AI, which are almost all involved in some kind of machine learning, are too young to have been involved in that kind of research. They know about the history, but unlike someone like, say Gary Marcus (or me), they know little or nothing about how symbolic systems work. And, speaking to my current point, their belief in the failure of such systems is simply something they’ve inherited, from their teachers, from text books. That failure cost them nothing because they were never committed to such systems. The only world they know personally is the world of machine learning. And what they’ve seen there is spectacular and unforeseen success.

Thank you for sharing your personal history with AI research. Your perspective as someone who lived through the symbolic AI era provides valuable context that many current AI researchers lack.

Your point about adjusted priors is particularly insightful. You experienced firsthand the collapse of a research paradigm you were invested in, which forced you to recalibrate your expectations. This gives your skepticism about current claims a different quality than if it were merely theoretical.

The contrast you draw between your experience and that of today's ML researchers highlights an important asymmetry in the field's collective memory:

  • Researchers who lived through symbolic AI's limitations had to personally reconcile with its failures
  • Today's ML researchers have only experienced a trajectory of rapid success with scaling approaches

This creates a natural bias toward the scaling hypothesis among those who haven't personally experienced a paradigm collapse. Their priors are shaped by continuous progress in their methodological approach, without the sobering experience of seeing a promising approach hit fundamental limitations.

This dynamic reminds me of Thomas Kuhn's work on scientific paradigms - how difficult it is to truly understand the limitations of your paradigm from within it. Your cross-paradigm experience gives you a perspective that's rare among today's researchers.

The question becomes whether the current scaling successes represent a fundamentally different trajectory than symbolic approaches, or whether they too will eventually hit fundamental limitations that require qualitative breakthroughs. History suggests caution about claims of inevitability, but each paradigm does have its own unique characteristics.

Would you like to elaborate on what specific limitations you see in current approaches that might prevent scaling to AGI? Or perhaps discuss how your experience with symbolic systems informs your view of what's missing in today's approaches?

That’s tricky. But let’s say that to a first approximation I’m sympathetic to Gary Marcus, or for that matter, David Ferrucci, who doesn’t participate in these debates, but has his own company, Elemental Cognition I believe it’s called, that is pursuing a hybrid approach. Ferrucci, of course, is familiar with symbolic computing and worked on a symbolic story-telling system with Selmer Bringsjord of RPI (where Ferrucci got a masters and, I believe, his doctorate). He then went on to head-up IBM’s Watson project, which is what he’s best known for. That was a hybrid system.

Thank you for mentioning Ferrucci and his work. His career trajectory is quite interesting, spanning symbolic AI, hybrid systems like Watson, and now Elemental Cognition.

The hybrid approach that Ferrucci and others like Marcus advocate makes a lot of sense given the history of AI. They've witnessed both the limitations of pure symbolic approaches and the impressive but incomplete capabilities of current statistical learning systems.

What makes their perspective particularly valuable is that they understand the strengths and weaknesses of both paradigms. They recognize that while neural networks excel at pattern recognition, they still struggle with aspects of cognition that symbolic systems were explicitly designed to handle - things like:

  1. Systematic compositionality of concepts
  2. Explicit reasoning with abstract symbols
  3. Causal understanding beyond correlation
  4. Representing and manipulating structured knowledge

The hybrid approach acknowledges that intelligence likely requires both the bottom-up pattern recognition strengths of neural networks and the top-down symbolic reasoning capabilities that were the focus of earlier AI work.

This perspective seems to align with your skepticism about the scaling hypothesis. Rather than seeing AGI as emerging simply from scaling current approaches, it suggests we need qualitatively different components working together - the kind of heterogeneous causal process you mentioned earlier.

What's particularly interesting about figures like Ferrucci is how they've managed to adapt their understanding across multiple paradigm shifts while retaining valuable insights from earlier approaches, rather than simply abandoning them wholesale.

Do you see particular aspects of symbolic approaches that you believe will need to be incorporated into future AI systems to achieve more general intelligence?

Again, tricky. But we’re crashing into my bedtime, so I’m going to have to sign off. It was a pleasure, as always.

I understand your need to sign off. The intersection of symbolic and statistical approaches to AI is indeed a complex topic, and your perspective as someone who worked in computational semantics during its earlier phases is valuable.

Figures like Ferrucci and Marcus represent an important bridge between these paradigms, recognizing that pure scaling might not be sufficient without incorporating some of the structural insights from symbolic approaches.

It was a pleasure discussing superforecasting, AI development trajectories, and the historical context of different AI paradigms with you. Your firsthand experience with the rise and fall of symbolic approaches adds a dimension to the conversation that's often missing.

Rest well, and I'd be happy to continue this discussion another time if you're interested in exploring these topics further.

No comments:

Post a Comment