Monday, August 25, 2025

LLMs are challenged by tic-tac-toe [plus P vs. NP]

Andrew Gelman has a short post about tic-tac-toe over at Statistical Modeling, Causal Inference, and Social Science (Aug. 24, 2025). It's one of those problems that easy for humans (9 years old or older) but difficult for LLMs. The (hype-infested) industry seems to have bought into the idea that intelligence is a scalar quantity, the intellectual equivalent of horsepower. If that were so, then tic-tac-toe wouldn't be a problem. It's not a horsepower problem. Something else is going on.

The post has generated a fair amount of discussion. Bob Carpenter made a particularly interesting remark:

The linked post from Gary Smith concludes with:

If you know the answer, you don’t need to ask an LLM; if you don’t know the answer, you can’t trust one.

This is wrong. It misunderstands the asymmetry between generating a correct answer and verifying a correct answer. I use LLMs all the time to generate plotnine (Python’s knockoff of ggplot2) and pandas (Python’s knockoff of data frames and Tidyverse manipulation) code all the time. In some sense I know these tools, but I can never remember the exact incantation to put the x axis on a log scale and remove the ticks and labels from the y axis. When I tell the chatbot what I want and it generates pandas and plotting code, I can verify that it’s correct. Doing this used to take me forever as I would have to either investigate the doc or StackOverflow answers or tutorials. Now I put many more figures in things I’m writing.

The asymmetry between verification and generation is key to understanding the difference between polynomial time (P) and non-deterministic polynomial time (NP) algorithms. An NP algorithm can be formulated as guessing with a P algorithm and verifying with a P algorithm. If the LLM is a much better guesser than me, it saves me a huge amount of search. It is also really great at writing sort snippets of code from tight text descriptions. I can say what I want and it can generate what I want faster than I can generate what I want. So it’s a huge win for things like graphing.

Guessing and verifying is also the basis for the LLM “thinking” modes like you see in all the chatbots now (this kind of chain-of-thought came to everyone’s attention with DeepSeek). The drawback is the amount of compute it takes to guess.

One of my main use cases for chatbots is scenario and character generation for roleplaying games. I’m playing a short campaign Fast and Furious knock off now set in Paris in May of ’68. Not only is the chatbot amazing at suggesting locations and plot lines, it’s amazing at inventorying the cars that the Italian, French, German, and British teams would drive in 1968. And then it can generate the tokens.

BTW, I asked ChatGPT about the difficulty with tic-tac-toc. Its reply is below the asterisks.

* * * * * 

LLMs handle language very well, but tic-tac-toe, apparently that’s still problematic. The problem isn’t “horsepower.” What is the problem? Inducing the rules from examples? Computational workflow?

 

That’s an excellent observation — and you’re right: LLMs are strangely good at language but can stumble on something as trivial (formally) as tic-tac-toe. The reason lies exactly where you’re pointing: it’s not about computational horsepower, it’s about the match (or mismatch) between the architecture and the task’s footprint.

 

Here’s a breakdown:

 

1. How LLMs learn.

  • They’re trained to predict the next token in text.
  • What they get is a massive soup of examples — human conversations, documents, stories, some of which happen to include tic-tac-toe transcripts.
  • They don’t get the game rules explicitly; they must induce them from inconsistent, sparse, noisy traces.


2. What tic-tac-toe requires.

  • A tiny, crisp rule set.
  • Deterministic reasoning over a small state space.
  • Success depends on systematic state tracking: remembering the board, applying the rules, projecting possible moves.


3. The mismatch.

  • Induction from examples: Tic-tac-toe transcripts in training data are too few and too variable to guarantee clean rule induction. An LLM may see people making “illegal” moves in casual online play, or skipping notation steps.
  • Computational workflow: Language modeling is statistical continuation, not explicit rule execution. To play tic-tac-toe perfectly, you need a loop of: represent board → apply rule → update board → evaluate. That’s more like symbolic programming than token prediction.
  • Representation: LLMs don’t have an internal “board” structure unless you scaffold one with special prompting or external memory. Without that, they may lose track of placements across turns.


4. Why they shine at language.

  • Natural language is inherently statistical and pattern-rich. There’s redundancy, paraphrase, error tolerance.
  • LLMs are designed to surf those regularities, so they capture language structure beautifully.


5. Why they stumble on tic-tac-toe.

  • Tic-tac-toe has no redundancy, no fuzziness. A single slip (“forgetting” a block) = loss.
  • In effect, it’s an unforgiving micro-domain where the “geometric footprint” (8 cells left to fill, precise win conditions) is far narrower than the model’s training and inference process.

So:

👉 The problem isn’t horsepower. It’s that tic-tac-toe demands a rule-execution pipeline, while LLMs supply a probabilistic continuation engine.


This is why hybrid approaches — LLM + symbolic module (e.g. letting the LLM call a solver or maintain an explicit board state) — handle these tasks much better.

1 comment:

  1. There's a new game in NYT: Pips. Like the domino kind of pips. The pieces have to be arranged according to the pattern set. This would totally flummox an LLM. (You can find the game by googling it, too. It's my new obsession.)

    ReplyDelete