Wednesday, June 18, 2025

Large Language Models and Emergence

David C. Krakauer, John W. Krakauer, and Melanie Mitchell, Large Language Models and Emergence: A Complex Systems Perspective, June 16, 2025, https://arxiv.org/pdf/2506.11135

Abstract: Emergence is a concept in complexity science that describes how many- body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with lower-dimensional effective variables and theories. This is captured by the idea “more is different”. Intelligence is a consummate emergent property manifesting increasingly efficient—cheaper and faster—uses of emergent capabilities to solve problems. This is captured by the idea “less is more”. In this paper, we first examine claims that Large Language Models exhibit emergent capabilities, reviewing several approaches to quantifying emergence, and secondly ask whether LLMs possess emergent intelligence.

From the conclusion:

We argued that in LLMs, the term emergence should be used not merely to signify surprising or unpredictable task performance, or abrupt changes in performance, but requires at minimum the identification of relevant coarse- grained variables that form effective mechanisms— reduced “internal degrees of freedom”—for this behavior, mechanisms that can explain or predict the be- havior of the system at this higher level, screening off details of lower level mechanisms such as weights and activations. More quantitative evidence for emergence includes the kinds of principles related to emergence in physical sys- tems, such as breaking of scaling through reorganization, evidence for the use of novel bases and manifolds formed through compression of regularities, and new forms of abstraction that lead to demonstrable efficiencies in prediction, prob- lem solving, generalization, and analogy-making. Identifying such principles would be an important step in understanding the seemingly novel capabilities that arise in LLMs.

Three types of emergence claims have been made for LLM capabilities: (1) sharp improvements in specific capabilities that occur as the system or training data is scaled; (2) capabilities are identified that the LLMs were not specifically trained for; and (3) internal “world models” emerging from autoregressive token prediction. Each of these cases, and particularly the last, present provocative evidence for emergence, but in all cases that evidence is incomplete. Cases (1) and (2) relies on several assumptions: that the capabilities tested are genuinely new, general, and don’t rely on memorized training data or other shortcuts; that these capabilities are not present in simpler models; and that the capabilities are unexpected or unpredictable given the training data and the models’ size. None of these assumptions has been conclusively verified. As for case (3) the complex- ity framework of [60] provides a principled approach to thinking about “world models” as these relate to discrete-time stochastic processes. To the extent that an LLM is effective at next-token prediction, and to the degree to which the model can be shown to exploit a minimum of information, they might be de- scribed as world models. However, the recent work by [61] demonstrates that recovering an accurate world model is very difficult, since next token prediction is a fragile metric.

That last is particularly important to me because it is a property of old-style semantic and cognitive networks. The network provides the world model (and should be linked to sensory and motor systems, as it was in the model David Hays developed in the mid-1970s) from which text can be generated through linguistic processes. LLMs conflate the two, text and cognition, into a single distributed representation.

Later:

There are three possible roles of language as it relates to training an LLM: (1) language itself provides a more or less complete and compressed representation of the world (including non-linguistic modalities); (2) spoken or written language mirrors an internal “language of thought”; and (3) language is a non-supervised “programming language”. If language does provide a complete representation of the world, then training on more language data would indeed enable an increasingly expansive and detailed representation of natural and cultural patterns and processes. If natural language is the language of thought (“mentalese”) then training on more language data would fill out the numerous ways that human- ity has historically reasoned about regularities in the world. And if language is a programming language, by combining detailed instruction tuning with next word prediction it can exploit principles of computational universality to imple- ment any computable function.

We do not have definitive evidence for any of these three claims, but they play a crucial role in any statement relating to how surprising the behavior of an LLM will be deemed.

The final paragraph:

Human intelligence is a low-bandwidth phenomenon, and is as much if not more about the scaling down of effort as the scaling up of capability [72]. As Einstein wrote, “The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms.” [73] We know that for any elegant algorithm there is an alternative brute force solution that does the job. It might even be the case that there are uncountable problems that require brute force and that this is a domain where LLMs and their cognitively alien relatives, including SAT solvers, will provide extraordinary utility [74]. What Donald Knuth said of programs might also be applied to intelligence: “Programs are meant to be read by humans and only incidentally for computers to execute.” [75]. Similarly, intelligence is a property of understanding and only incidentally a matter of capability.

I really like the first sentence of that last paragraph. It "resonates" with the definition of intelligence I gave in What Miriam Yevick Saw: The Nature of Intelligence and the Prospects for A.I.:

Intelligence is the capacity to assign computational capacity to propositional (symbolic) and/or holographic (neural) processes as the nature of the problem requires.

As Yevick herself observed:

If we consider that both of these modes of identification enter into our mental processes, we might speculate that there is a constant movement (a shifting across boundaries) from one mode to the other: the compacting into one unit of the description of a scene, event, and so forth that has become familiar to us, and the analysis of such into its parts by description. Mastery, skill and holistic grasp of some aspect of the world are attained when this object becomes identifiable as one whole complex unit; new rational knowledge is derived when the arbitrary complex object apprehended is analytically described.

No comments:

Post a Comment