Monday, April 15, 2024

The AI industry lacks useful ways of measuring performance [the boastful leading the blind]

Kevin Roose, A.I. Has a Measurement Problem, NYTimes, April 25, 2024.

There’s a problem with leading artificial intelligence tools like ChatGPT, Gemini and Claude: We don’t really know how smart they are.

That’s because, unlike companies that make cars or drugs or baby formula, A.I. companies aren’t required to submit their products for testing before releasing them to the public. There’s no Good Housekeeping seal for A.I. chatbots, and few independent groups are putting these tools through their paces in a rigorous way.

Instead, we’re left to rely on the claims of A.I. companies, which often use vague, fuzzy phrases like “improved capabilities” to describe how their models differ from one version to the next. And while there are some standard tests given to A.I. models to assess how good they are at, say, math or logical reasoning, many experts have doubts about how reliable those tests really are.

Safety risk:

Shoddy measurement also creates a safety risk. Without better tests for A.I. models, it’s hard to know which capabilities are improving faster than expected, or which products might pose real threats of harm.

In this year’s A.I. Index — a big annual report put out by Stanford University’s Institute for Human-Centered Artificial Intelligence — the authors describe poor measurement as one of the biggest challenges facing A.I. researchers.

“The lack of standardized evaluation makes it extremely challenging to systematically compare the limitations and risks of various A.I. models,” the report’s editor in chief, Nestor Maslej, told me.

Massive Multitask Language Understanding:

The MMLU, which was released in 2020, consists of a collection of roughly 16,000 multiple-choice questions covering dozens of academic subjects, ranging from abstract algebra to law and medicine. It’s supposed to be a kind of general intelligence test — the more of these questions a chatbot answers correctly, the smarter it is.

It has become the gold standard for A.I. companies competing for dominance. (When Google released its most advanced A.I. model, Gemini Ultra, earlier this year, it boasted that it had scored 90 percent on the MMLU — the highest score ever recorded.)

Dan Hendrycks, an A.I. safety researcher who helped develop the MMLU while in graduate school at the University of California, Berkeley, told me that the test was never supposed to be used for bragging rights. He was alarmed by how quickly A.I. systems were improving, and wanted to encourage researchers to take it more seriously.

Mr. Hendrycks said that while he thought MMLU “probably has another year or two of shelf life,” it will soon need to be replaced by different, harder tests. A.I. systems are getting too smart for the tests we have now, and it’s getting more difficult to design new ones.

Problems:

There may also be problems with the tests themselves. Several researchers I spoke to warned that the process for administering benchmark tests like MMLU varies slightly from company to company, and that various models’ scores might not be directly comparable.

There is a problem known as “data contamination,” when the questions and answers for benchmark tests are included in an A.I. model’s training data, essentially allowing it to cheat. And there is no independent testing or auditing process for these models, meaning that A.I. companies are essentially grading their own homework.

In short, A.I. measurement is a mess — a tangle of sloppy tests, apples-to-oranges comparisons and self-serving hype that has left users, regulators and A.I. developers themselves grasping in the dark.

There's more at the link.

Color me "not at all surprised." Not only does the field lack a sound theoretical basis, as far as I can tell, it doesn't even know that Hey! that might be useful at at time like this. I don't have a theory to hand over, thought I have a thought or three about how one might go about developing one, but then I'm not making (unfounded) performance claims either.

Without a coherent way of measuring performance, how can you guide the development of your products? Are we in Breugel-land, with the blind leading the blind?

No comments:

Post a Comment