Friday, May 13, 2022

Arithmetic and Machine Learning, Part 2

Continuing from my post of 4.26.22, Why is simple arithmetic difficult for deep learning systems? I posted this to LessWrong on 5.11.22:

I’ve been thinking about ordinary arithmetic computation in this context. We know that models have trouble with it. The issue interests me because arithmetic calculation has well-understood procedures. We know how people do it. And by that I mean that there’s nothing important about the process that’s hidden, unlike our use of ordinary language. The mechanisms of both sentence-level grammar and discourse structure are unconscious.

It's pretty clear to me that arithmetic requires episodic structure, to introduce a term from old symbolic-systems AI and computational linguistics. That’s obvious from the fact that we don’t teach it to children until grammar school, which is roughly episodic level cognition kicks in (see the paper Hays and I did, Principles and Development of Natural Intelligence).

Arithmetic is not like ordinary language is for humans, which comes to us naturally without much specific training. Fluency in arithmetic requires years of drill. First the child must learn to count; that gives numbers meaning. Once that is well in hand, children are drilled in arithmetic tables for the elementary operations, and so forth. Once this is going smoothly one learns the procedures multiple-digit addition and subtraction, multiple-operand addition and then multiplication and division. Multiple digit division is the most difficult because it requires guessing, which is then checked by actual calculation (multiplication followed by subtraction).

Why do such intellectually simple procedures require so much drill? Because each individual step must be correct. You can’t just go straight ahead. One mistake anywhere, and the whole calculation is thrown off.

Whatever a model is doing in inference mode, I doubt it’s doing anything like what humans do. Where would it pick that up on the web?

I don’t know what’s going on inside model in inference mode, but I’d guess it’s something like this: The inference engine ‘consumes’ a prompt, which moves it to some position in its state space.

  1. It has a number of possibilities for moving to a new position.
  2. It picks one and emits a word.
  3. Is it finished? If so, stop. If not, return to 1.

And so it moves through its state space in a single unbroken traversal. You can’t do arithmetic that way. You have to keep track of partial results and stop to retrieve them so you can integrate them into the ongoing flow of the calculation.

So now the question is: What other kinds of tasks require the computational style that arithmetic does? Perhaps generating a long strong of coherent prose does.

Let me think about that for awhile.

I’m still thinking. This is going to be rough and crude, but it’s what I need to do at the moment. Sorry.

Episodic structure involves localizing objects and events in time and space. So we’ve got “O & E” for objects and events and “T•S” for time and space, thus: [T•S(O & E)]. A string of them:

[T•S(O & E)] --> [T•S(O & E)] --> [T•S(O & E)]

Or we could simplify: E --> E --> E. Or just: E1, E2, E3.

So:

E1: Johnny went though the door.
E2: Johnny walked past the tree.
E3: Johnny crossed the street.

In arithmetic, to add 15 and 7, we say (mentally) something like:

5 plus 7 equals 12
write 2, carry the 1
1 plus 1 equals 2
write 22

That looks something like:

E1: 5 + 7 = 12
E2 (reserve 1): write 2
E3 (retrieve 1): 1 + 1 = 2
E4 (2 concat 2): write 2_ = 22

My point is that intermediate results are being tracked at the episode level, not the proposition level.

So that’s one thing. I’m also thinking about the fact that, in Vygotsky’s view, language involves internalizing an Other. And once we’ve done that and have it thoroughly routinized – leap of logic – we’re ready to learn to write and to do arithmetic. Why? because we need that internalized Other to keep track of the distinction between the episode and the proposition(s) in the episode. The internalized Other marks the episode while we worry about the proposition(s).

Think about this: we need episodic structure to distinguish between signifier and signified. It’s once we’ve acquired episodic structure that we learn to read and write. Reading and writing forces awareness of the signifier/signified distinction on us because it confronts us with two different signifier for the same signified.

And arithmetic calculation requires awareness of that distinction. “2 + 2”, “3 + 1”, “2 * 2”, and “9 – 5” (among many others), and “4” are all signifiers for the same cardinal value. How do we learn that strange fact? Through counting objects and working with collections of objects in conjunction with those number symbols. Counting is episodic. It allows us to see numerals, as signifiers, and counted objects and signifieds. Doing abstracted arithmetic forces us to treat the numerals as signifiers for imaginary objects.

These deep learning engines have no distinction between signifier and signified. They have no episodic structure. Theirs is a very thin and flat world.

More later.

Addendum, 5.14.22: What about “word problems,” as we used to call them – and, for all I know, still do. You know:

Jane went to the store with $20. She bought a hair brush for $5.95, plus 5% sales tax, took a ride on the ferris wheel for $4, lost 35¢ from her pocket, and found two dimes, three pennies, and a quarter on the sidewalk coming home. How much money did she have when she got home? Now if she puts that in a savings account where the interest compounds at a rate of 5% annually, what will it be worth in 10 years?

There’s nothing particularly hard or deep about such problems, but an AI that can’t handle them isn’t going to get anywhere near AGI. 

* * * * *

* * * * *

Opening for the blog post linked in the above tweet:

We’ve trained a system that solves grade school math problems with nearly twice the accuracy of a fine-tuned GPT-3 model. It solves about 90% as many problems as real kids: a small sample of 9-12 year olds scored 60% on a test from our dataset, while our system scored 55% on those same problems. This is important because today’s AI is still quite weak at commonsense multistep reasoning, which is easy even for grade school kids. We achieved these results by training our model to recognize its mistakes, so that it can try repeatedly until it finds a solution that works.

Abstract for the underlying research paper:

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Comment: Off-hand it looks like an ingenious work-around, but not a direct solution of the problem.

No comments:

Post a Comment