LLMs process text from left to right — each token can only look back at what came before it, never forward. This means that when you write a long prompt with context at the beginning and a question at the end, the model answers the question having "seen" the context, but the… pic.twitter.com/qwxt7R7RIG
— BURKOV (@burkov) February 17, 2026
The reason this works tells you everything about the gap between how people think LLMs process information and how they actually process it. Every token can only look backward. So when you write “here’s a list of 50 names” followed by “what’s the 25th name?”, the list tokens were processed with zero awareness that a question was coming. The question tokens can see the list, but the list never saw the question.
Repeating the prompt gives every token a second pass where it can attend to everything else. You’re essentially hacking bidirectional attention into a unidirectional system. And the cost is nearly zero because prefill is parallelized on modern hardware.
But here’s what makes this actually interesting: reasoning models already do this. When you enable chain-of-thought, the gains from repetition almost entirely disappear (5 wins, 1 loss, 22 ties). That means reasoning models trained with RL independently learned to repeat the user’s prompt back to themselves before answering. The “thinking” that costs you 10x more tokens and 5x more latency is partly just the model giving itself a second look at your input.
Which means a meaningful chunk of what we’re paying for with “reasoning” tokens could be Which means a meaningful chunk of what we’re paying for with “reasoning” tokens could be replicated for free at the architecture level. The entire prompt repetition paper is an accidental proof that causal attention is leaving massive performance on the table, and that the industry’s current fix (burn more tokens thinking) is the expensive workaround for a structural limitation nobody’s addressing directly.
No comments:
Post a Comment