NEW SAVANNA: Data isn't enough, not even the BIGGEST and BEST DATA, because correlation doesn't tell you cause, and cause is important (!)

Friday, September 6, 2019

Data isn't enough, not even the BIGGEST and BEST DATA, because correlation doesn't tell you cause, and cause is important (!)

Tim Maudlin reviews Judea Pearl and Dana Mackenzie, The Book of Why, in Boston Review:

Pearl also has one big axe to grind, especially when it comes to the study of human cognition—how we think—and the hype surrounding contemporary artificial intelligence. “Much of this data-centric history still haunts us today,” he writes. It has now been eleven years since Wired magazine announced “the end of theory,” as “the data deluge makes the scientific method obsolete.” Pearl swims strenuously against this tide. “We live in an era that presumes Big Data to be the solution to all our problems,” he says, “but I hope with this book to convince you that data are profoundly dumb.” Data may help us predict what will happen—so well, in fact, that computers can drive cars and beat humans at very sophisticated games of strategy, from chess and Go to Jeopardy!—but even today’s most sophisticated techniques of statistical machine learning can’t make the data tell us why. For Pearl, the missing ingredient is a “model of reality,” which crucially depends on causes. Modern machines, he contends against a chorus of enthusiasts, are nothing like our minds.

To see why mere correlation isn't sufficient, consider this toy example:

To make the stakes clear, consider the following scenario. Suppose there is a robust, statistically significant, and long-term correlation between the color of cars and the annual rate at which they are involved in accidents. To be concrete, assume that red cars, in particular, are involved in accidents year after year at a higher rate than cars of any other color. When you go to buy a new car, should you avoid the color red in your quest to remain safe on the road?

On the other hand, the correlation may have nothing at all to do with the dangerousness of the color itself. It could, for example, be the byproduct of a common cause. People who choose red cars may tend to be more adventurous and thrill-seeking than the average driver, and so be involved in proportionally more accidents. Then again, the correlation may have nothing to do with driving abilities at all. People who buy red cars may just enjoy driving more than other people, and spend more hours a year on the road.

And so forth:

This toy example illustrates the fundamental problem of causal reasoning: How can we find our way through such a thicket of alternative explanations to the causal truth of the matter?

Early in his career Pearl tried to deal with such problems by throwing more data at them thereby uncovering more correlations. He decided it didn't work:

In short, you will never get causal information out without beginning by putting causal hypotheses in.

This book is the story of how Pearl came to this realization. In its wake, he developed simple but powerful techniques using what he calls “causal graphs” to answer questions about causation, or to determine when such questions cannot be answered from the data at all. The book should be comprehensible to any reader with sufficient interest to pause over some formulas to digest their conceptual meaning (though the precise details will require some effort even by those with background in probability theory). The good news is that the main innovation that Pearl is advertising—the use of causal hypotheses—gets couched not so much in algebra-laden statistics as in visually intuitive pictures: “directed graphs” that illustrate possible causal structures, with arrows pointing from postulated causes to effects.

And so forth and so on until:

But how do we decide which causal models to test in the first place? For Pearl, they are provided by the theorist on the basis of background information, plausible conjectures, or even blind guesses, rather than being derived from the data. The method of causal graphs allows us to test the hypotheses, both by themselves and against each other, by appeal to the data; it does not tell us which hypotheses to test. (“We collect data only after we posit the causal model,” Pearl insists, “after we state the scientific query we wish to answer. . . . This contrasts with the traditional statistical approach . . . which does not even have a causal model.”)

Hmmm... You know, back in my undergraduate years at Johns Hopkins I took a course in social theory taught by Arthur Stinchcombe, whose Constructing Social Theories has become a classic. He taught us that a good social scientist should come up with two or three hypothesis about what is going on and then design experiments that allow you to discriminate between them. Makes sense, no? That seems to be what Pearl is arguing.

The whole review is worth reading.

H/t 3QD.