Here's how Claude summarizes a discussion we had last evening.
* * * * *
Recently, chatbots have been performing impressively on mathematics benchmarks and coding challenges. Headlines tout AI systems solving competition problems, proving theorems, and writing working code. At first glance, this seems to vindicate the assumption that both mathematical and computational reasoning are fundamentally like chess—domains with verifiable answers where scaling up compute power and training data will inevitably lead to mastery.
But this conclusion deserves closer examination. Mathematics and programming do share important chess-like qualities. In math, formal proofs can be verified mechanically, solutions to well-defined problems can be checked objectively, and mathematical statements have definite truth values. In programming, code either runs or doesn't, test suites provide objective evaluation, solutions exist in a definable space, and correctness can be automatically verified. In these respects, both domains seem perfectly suited to AI's strengths.
Yet there's a crucial distinction that gets obscured when we focus on benchmark performance: the difference between verification and discovery.
Chess engines don't merely verify that moves are legal—they find good moves by searching the game tree. The tree structure makes this search tractable. For mathematics and programming, verification is relatively mechanical and chess-like. Checking that a proof is valid or that code passes its test suite is straightforward. But discovery—finding a proof, solving a novel problem, formulating a new approach, architecting a complex system—is much more open-ended.
This raises an important question about those impressive benchmarks: What are they actually testing?
A recent New York Times article shed light on this question for mathematics. Journalists interviewed mathematicians who decided to test AI systems not on standard benchmark problems, but on questions drawn from their ongoing research programs. One of them, Lauren Williams, offered a revealing three-part framework for mathematical research:
One, come up with the big question, whose study we hope will guide our field. Two, develop a framework for finding a solution, which involves dividing the big question into smaller more tractable questions—like our test questions. Three, find solutions to these smaller questions and prove they are correct.
Williams observed that for the most part, AI systems have been working at the third step. The same framework applies to software development. Step one is recognizing what problem needs solving, understanding user needs, architecting a system that will be maintainable and extensible. Step two is breaking down that architecture into components, modules, and functions. Step three is implementing those components—writing the code that passes the tests.
AI coding assistants excel at step three. Give them a well-specified function with clear inputs, outputs, and test cases, and they'll often produce working code quickly. But ask them to architect a complex system, identify the right abstractions, or recognize that a problem is better solved by rethinking the approach entirely, and their limitations become apparent.
This division maps onto a deeper pattern in what AI can and cannot do well. Step three is chess-like: well-defined smaller questions, verifiable solutions, answers that exist in a searchable space where pattern matching from training data provides genuine help. This is precisely where benchmarks live, testing AI on problem types it has seen before, where standard techniques apply and correctness can be verified.
Step one, however, is fundamentally different. It requires understanding what matters—in a field, in a codebase, for users. It demands seeing deep connections across domains, developing intuition about which directions will prove fertile. It requires conceptual creativity and world-modeling, not just pattern recognition. The real creativity in both mathematics and software engineering happens here, not in the mechanical execution of familiar techniques.
What the benchmarks miss is that even mathematics and programming—perhaps our most formal and verifiable intellectual domains—contain a fundamental divide between their mechanical parts and their creative parts. Current AI systems excel at the former while struggling with the latter.
This has implications beyond math and code. We learned from chess that computers can achieve superhuman performance at searching well-defined spaces with verifiable outcomes. But the field drew the wrong lesson, assuming this capability would automatically generalize to all forms of intelligence. The math and coding benchmark results initially seem to confirm this assumption—until we look more carefully at what's actually being tested.
The pattern holds across domains. AI systems perform well when problems resemble step three: well-defined, verifiable, solvable through pattern matching and local search. They struggle with step one: formulating the right questions, building conceptual frameworks, recognizing what matters. We keep mistaking competence at the former for the latter, then wondering why general intelligence remains elusive.
The benchmarks aren't lying, exactly. AI is getting genuinely better at certain kinds of mathematical problem-solving and code generation. But they're measuring proficiency at the chess-like parts of these domains while remaining largely silent about the parts that require understanding. And it’s in that silence that our assumptions about scaling toward AGI quietly live—untested and unexamined.
No comments:
Post a Comment