Abstract: AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3% on an 8th Grade science exam challenge.
This paper reports unprecedented success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90% on the exam's non-diagram, multiple choice (NDMC) questions. In addition, our Aristo system, building upon the success of recent language models, exceeded 83% on the corresponding Grade 12 Science Exam NDMC questions. The results, on unseen test questions, are robust across different test years and different variations of this kind of test. They demonstrate that modern NLP methods can result in mastery on this task. While not a full solution to general question-answering (the questions are multiple choice, and the domain is restricted to 8th Grade science), it represents a significant milestone for the field.
From the introduction:
Instead of a binary pass/fail, machine intelligence is more appropriately viewed as a diverse collection of capabilities associated with intelligent behavior. Finding appropriate benchmarks to test such capabilities is challenging; ideally,a benchmark should test a variety of capabilities in a natural and unconstrained way, while additionally being clearly measurable, understandable, accessible, and motivating.Here's a story about this in the NYTimes.
Standardized tests, in particular science exams, are a rare example of a challenge that meets these requirements.While not a full test of machine intelligence, they do explore several capabilities strongly associated with intelligence, including language understanding, reasoning, and use of common-sense knowledge. One of the most interesting and appealing aspects of science exams is their graduated and multifaceted nature; different questions explore different types of knowledge, varying substantially in difficulty. For this reason, they have been used as a compelling—and challenging—task for the field for many years (Brachmanet al., 2005; Clark and Etzioni, 2016).
No comments:
Post a Comment