Mark Liberman at Language Log reports on John Seabrook, "The Next Word: Where will predictive text take us?", The New Yorker 10/14/2019, which is about the marvels of GPT-2. After quoting from the article, taking GPT-2 out for a spin, and saying a bit of this and that, Liberman observes:
This undermines the claim of GPT-2's creators that they've withheld the code because they're afraid it would be used to create malicious textual deep fakes — I suspect that such things would routinely exhibit significant "world modelling failures".
And Seabrook's experience illustrates a crucial lesson that AI researchers learned 40 or 50 years ago: "evaluation by demonstration" is a recipe for what John Pierce called glamor and (self-) deceit ("Whither Speech Recognition", JASA 1969). Why? Because we humans are prone to over-generalizing and anthropomorphizing the behavior of machines; and because someone who wants to show how good a system is will choose successful examples and discard failures. I'd be surprised if Seabrook didn't do a bit of this in creating and selecting his "Read Predicted Text" examples.
In general, anecdotal experiences are not a reliable basis for evaluating scientific or technological progress; and badly-designed experiments are if anything worse.
Caveat yada yada.
Liberman recommends thinking about the Winograd Schema Challenge, which poses simple language-based problems that are difficult for computational systems to handle, GPT-2 included.