Wednesday, February 18, 2026

AGI has NOT been achieved.

In a recent tweet Valerio Capraro explains why recent claims of reaching AGI are wrong:

1) They shift the definition of general intelligence, originally based on robustness, generalization, and reliability, to behavioral alignment with benchmarks.

2) They confuse benchmark performance with capability to handle novelty. Spoiler: these are different.

3) They ignore that the same behavioral output can come from totally different epistemic pipelines.

He then links to this long post by Gary Marcus, Walter Quattrociocchi, and himself: Rumors of AGI’s arrival have been greatly exaggerated. Concerning benchmarks they say:

Much of the argument that artificial general intelligence has already been achieved rests on benchmark performance (e.g., Chen et al., 2026). Benchmarks evaluate specific capabilities under controlled conditions and have been useful for tracking progress. For example, Chen and colleagueset al., writing in this journal, argue that success on the Turing Test constitutes evidence of AGI.

However, benchmark success is a limited indicator of general intelligence. By design, benchmarks isolate narrow competencies and abstract away real-world context, making it difficult to distinguish genuine generalization from pattern recognition. Strong benchmark performance often provides little evidence of robustness under novelty, uncertainty, or shifting objectives.

Yes! Back in January 2025 I got ChatGPT to produce an argument about the weakness of benchmarks, ChatGPT critiques benchmarks as a measure of LLM performance and then elaborates on my whaling analogy for what’s wrong with the AI business.

Marcus, Quattrociocchi, and Capraro conclude:

By the standards articulated in the original definitions of artificial general intelligence—robustness across environments, reliable generalization under novelty, and autonomous goal-directed behavior—current AI systems remain limited. Despite impressive gains in narrow competence and fluency, today’s large language models lack persistent goals, struggle with long-horizon reasoning, and depend extensively on human scaffolding for task formulation, evaluation, and correction. Reports that language models have produced correct proofs for isolated open problems in mathematics, including specific Erdős problems, do not alter this assessment. As noted by mathematicians such as Terence Tao, these results primarily reflect the ability to rapidly search, recombine, and iterate over existing techniques, rather than the emergence of genuinely novel or domain-general problem-solving strategies. Moreover, inclusion in the Erdős list does not by itself imply exceptional conceptual difficulty, as some problems remain unsolved due to relative obscurity rather than depth.

These limitations are central rather than peripheral. They directly concern reliability under uncertainty, resistance to systematic failure, and cross-domain transfer without task-specific tuning. On these dimensions, current systems remain brittle, sensitive to prompt framing, and inconsistent outside curated evaluation settings. Recognizing these constraints does not diminish recent progress; it clarifies its scope.

No comments:

Post a Comment