Monday, December 20, 2010

Steak, War and Ngrams

In one of Language Log’s discussions of Google’s Ngram Viewer one commenter, named John, posted a query showing the frequency of pork chops, fried chicken, meat loaf, and steak from 1920 through 2008:


He thought it suspicious that “they all peak at the same place, in the early 1940s and are all on the rise again now.” In a later comment he elaborated:
Part of my point was that it doesn't seem to be a very balanced corpus at all, unless one can come up with a good reason why several food terms rise and fall together in frequency over a period of decades. That strikes me as more an artifact of the data set than a sign of any change in the language.
I took a somewhat different view. While recognizing that the data set surely has artifacts and that those curves might a symptom of that, I was struck by the rise in the early to middle 1940s. That, of course, is the period of World War II. I pretty much assume, on general principle, that such large-scale events have noticeable effects on the mind-set of a population and those effects would likely show up in the books published during the period.

Perhaps this odd rise in the food-terms curves was one such effect. I have no particularly good explanation of why it should be such an effect – though I’ve thought about rationing during war time. I just thought the coincidence interesting and suggestive. So I set out to investigate.

Note that, even if I am right about such a connection, that doesn’t explain the rise in the late 1990s and into the new century. For the purpose of getting on with it, I just set that issue aside. Also note that for the purposes of this post, I’m not going to recount just what I did in the order that I did it; you can follow that in the comments at Language Log if it matters to you.

First, John’s query ranged from 1920 through 2008. Let’s push the start date back to 1900 and see if World War I shows up:


It does, but only in the curve for steak (the yellow line at the top). The curves for pork chops, fried chicken, and meat loaf show a low steady rise through that period. The fact that we do see one peak for WWI suggests that we may be looking at something that’s real rather than being an artifact of the data. Though I regard the suggestion as rather weak.

But, taking it at face value, why should steak be different from the others? I don’t know.

Now let’s try something else. Google Books has provided five different collections of English-language n-grams. We’re running these searches against the largest and most comprehensive collection. Let’s run it against British English:


As in the first two cases, the curve for steak is higher than the curves for pork chops, fried chicken, and meat loaf. But neither WWI nor WWII shows up very strongly, though we do have a rise at the end of the 1990s.

Now let’s check the collection for American English:


WWI and WWII are back. And perhaps we’re seeing the Korean War in the 1950s. Perhaps.

Now, to round things off, let’s try jeep and Jeep. The search is case sensitive, with “jeep” being the generic term of a certain kind of military vehicle while “Jeep” is trademark brand name for the same vehicle. Neither existed before WWII, so I’ll start the curve at 1930:


As expected, “jeep” gives us a sharp rise at the beginning of the war while “Jeep” does not. Notice that both rise at the end of the 1990s and then drop off around 2005.

So, what do we make of this?

It’s hard to say. I do think there’s something going on with the two wars and food terms. But just what, that’s hard to say. Why should the British and American collections differ on that? Britain experienced the wars differently from America. But whether that’s the cause or whether we’re dealing with differences in diet, I don’t know.

The rises in the late 1990s and on into the next century would seem to have a different cause, but what that is, I don’t know. Maybe it’s some sort of millennial effect. To be sure, that transition from 1999 to 2000 is just an arbitrary point in time, but it’s an arbitrary point that matters in how people think about the world. But why these terms should go up, I don’t know. And, since these curves represent percentages, not absolute values, the fact that some curves are rising implies that some other curves are going down. If these terms are on the rise over millennial anticipation, what terms are on the way down?

Two issues seem to emerge from this little discussion. If we are seeing the effects of two world wars and a millennial transition, we’d certainly like to investigate that. And we don’t really want to do that by testing n-grams one, two, and three at a time. We want to look at the time functions of every n-gram in the collections for correlations with those three events. That will take some pretty massive computing power.

The other issue is causality. There is an obvious account for the correlation between the “jeep” curve and WWII. It’s a military vehicle that saw extensive use in that war. But why should “steak” rise and fall with the two wars, and only in the American corpus? The causal question will multiply, of course, in the wake of testing all the n-grams against the two wars.

No comments:

Post a Comment