Just read Gary Marcus, Does AI Really Need a Paradigm Shift. The BIG-Bench paper, Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models, speaks to his concerns. In particular, this passage struck me:
Limitations that we believe will require new approaches, rather than increased scale alone, include an inability to process information across very long contexts (probed in tasks with the keyword context length), a lack of episodic memory into the training set (not yet directly probed), an inability to engage in recurrent computation before outputting a token (making it impossible, for instance, to perform arithmetic on numbers of arbitrary length), and an inability to ground knowledge across sensory modalities (partially probed in tasks with the keyword visual reasoning).
It's the "inability to engage in recurrent computation before outputting a token" that has my attention, as I've been thinking about that one for awhile, though that is a more succinct formulation of the problem than any that I’ve come up with. I note that our capacity for arithmetic computation is not part of our native endowment. It doesn't exist in pre-literate cultures and our particular system originated in India and China and made its way to Europe via the Arabs. We owe the words "algebra" and "algorithm" to that process.
Think of that capacity as a very specialized form of language, which it is. That is to say, it piggy-backs on language. That capacity for recurrent computation is part of the language system. Language involves both a stream of signifiers and a stream of signifieds. I think you'll find that the capacity for recurrent computation is required to manage those two streams. And that's where you'll find operations over variables and an explicit type/token distinction [which Marcus mentions in his post].
Of course, linguistic fluency is one of the most striking characteristics of these LLMs. So one might think that architectural weakness – for that is what it is – has little or no effect on language, whatever its effect on arithmetic. But I suspect that's wrong. We know that the linguistic fluency has a relatively limited span. I'm guessing effectively and consistently extending that span is going to require the capacity for recurrent computation. It's necessary to keep focused on the unfolding development of a single topic. That problem isn't going to be fixed by allowing for wider attention during the training process, though that might produce marginal improvements.
The problem is architectural and requires an architectural fix, both for the training engine and the inference engine. Simply scaling up – more parameters, more data – won't fix the problem.
Addendum, 6.12.22 11:36 AM: From, David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models, 2019. URL https://arxiv.org/abs/1904.01557.
The Transformer model has a performance of 90% or more on the “add or subtract several numbers" module and the “multiply or divide several numbers" module (which is just addition and subtraction in log space). However on the mixed arithmetic module (mixing all four operations together with parentheses), the performance drops to around 50%. (Note the distribution of the value of the expression is the same for all these modules, so it is not the case that difficulty increases due to different answer magnitudes.) We speculate that the difference between these modules in that the former can be computed in a relatively linear/shallow/parallel manner (so that the solution method is relatively easier to discover via gradient descent), whereas there are no shortcuts to evaluating mixed arithmetic expressions with parentheses, where intermediate values need to be calculated. This is evidence that the models do not learn to do any algebraic/algorithmic manipulation of values, and are instead learning relatively shallow tricks to obtain good answers on many of the modules. The same holds true for other modules that require intermediate value calculation, such as evaluating polynomials, and general composition.