Siobhan Roberts, These Mathematicians Are Putting A.I. to the Test, NYTimes, Feb. 7, 2026.
Dr. Martin Hairer (Swiss Federal Technology Institute of Lausanne), Mohammed Abouzaid (Stanford University), Lauren Williams (Harvard University) and Tamara Kolda (who runs MathSci.ai, a consultancy) are among a group of mathematicians who have published an article, “First Proof,” about an “experiment that collects genuine test questions, drawn from unpublished research by the authors, in an effort to provide a meaningful measure of A.I.’s mathematical competency.”
“While commercial A.I. systems are undoubtedly already at a level where they are useful tools for mathematicians,” the authors wrote, “it is not yet clear where A.I. systems stand at solving research-level math questions on their own, without an expert in the loop.”
A.I. companies use what some mathematicians describe as “contrived” or “restrictive” problems for evaluating and benchmarking how well L.L.M.s fare when operating without human help. Occasionally, mathematicians are invited to contribute and paid some $5,000 per problem.
From the conversation:
The paper is careful to clarify “what mathematics research is.” What is it?
ABOUZAID Often in modern research, the key step is to identify the big motivating question, the direction from which the problem should be approached. It involves all kinds of preliminary work, and this is where mathematical creativity takes place.
Once problems are solved, mathematicians tend to evaluate the importance of research contributions in terms of the questions that arise. Sometimes, resolving a conjecture one way is seen as disappointing, because it forecloses the possibility that there would be new questions to investigate.
LAUREN WILLIAMS Let me make a loose analogy. In experimental science, I might divide the components of research into three parts: One, come up with the big question, whose study we hope will shed light on our field. Two, design an experiment to answer the question. Three, perform the experiment and analyze the results.
I can similarly divide math research into parallel parts: One, come up with the big question, whose study we hope will guide our field. Two, develop a framework for finding a solution, which involves dividing the big question into smaller more tractable questions — like our test questions. Three, find solutions to these smaller questions and prove they are correct.
All three parts are essential. In our First Proof project, we focused on the third component because it is the most measurable. We can query the A.I. model with small, well-defined questions, and then assess whether its answers are correct. If we were to ask an A.I. model to come up with the big question, or a framework, it would be much harder to evaluate its performance.
Note that this is roughly consistent with the accounts I gave of some of my own work in Serendipity in the Wild: Three Cases, With remarks on what computers can't do, January 8, 2026. That they focused on the third component is consistent with my impression that the problems LLMs solve successfully are in well-specified more or less closed domains. But, as Abouzaid noted, the creativity takes place before such problems have been identified.
MARTIN HAIRER One thing I noticed, in general, was that the model tended to give a lot of details on the things that were easy, where you would be like: “Yeah, sure, go a bit faster. I’m bored with what you’re saying.” And then it would give very little detail with the crux of the argument. Sometimes it would be like reading a paper by a bad undergraduate student, where they sort of know where they’re starting from, they know where they want to go, but they don’t really know how get there. So they wander around here and there, and then at some point they just stick in “and therefore” and pray.
Sounds like the classic hand-waving — lacking rigor, skipping over complexities.
HAIRER Yeah, it’s pretty good at giving hand-wavy answers.
So, you weren’t impressed?
HAIRER No, I wouldn’t say that. At times I was actually quite impressed — for example, with the way it could string together a bunch of known arguments, with a few calculations in between. It was really good at doing that correctly.
In your dream world, what would the A.I. be doing for you?
HAIRER Currently the output of the L.L.M.’s is hard to trust. They display absolute confidence, but it requires a lot of effort to convince yourself whether their answers are correct or not; I find it intellectually painful. Again, it’s like a graduate student where you don’t quite know whether they are strong or whether they’re just good at B.S. The ideal thing would be a model that you can trust.
KOLDA A.I. is touted as being like a colleague or a collaborator, but I don’t find it to be true. My human colleagues have particular outlooks, and I especially enjoy when we debate different points of views. An A.I. has whatever viewpoint I tell it to have, which is not interesting at all!
No comments:
Post a Comment