Reading Macroanalysis 4: On the matter of “the”

Chapter 7, “Nationality” is pretty straightforward. I don’t have much to say about it except for a puzzle that Jockers presents at the beginning. He points out that, because British and American writers have different practices concerning the word the, that word is about 5 percent of the word tokens in his corpus of 19th Century British novels, while it is about 6 percent of the tokens in the American novels.

That is a trivial and straightforward matter. Now, when you plot yearly usage throughout the century you find that British and American usage track one another fairly well (p. 106):
Over a period of one hundred years, it is as if the writers in these two nations–two nations separated by several thousand miles of water in an age before mass communication–made a concerted effort to modulate their usage of this most common of common words. However, the is not the kind of word that authors would consciously agonize over; quite the contrary, the is a trivial word, a function word used automatically and by necessity. Whereas the use of the word beautiful, for example, may come and go with the fads of culture, the word the is a whole different animal. For comparison, consider figure 7.2, which charts the relative frequency of the word beautiful in the same corpus. Whereas the is nearly parallel, beautiful is erratic and unpredictable.
Which rather seems to me to be the point. The use of the is fixed by grammar and the grammars of British and American English are pretty much the same. And, while word usage can come and go relatively quickly, grammar changes slowly. So I’m not terribly puzzled by the fact that American and British usage of the track one another fairly well during the 19th Century, nor that their usage of beautiful is different.

Jockers goes on to calculate correlation coefficients. The coefficient for year-to-year fluctuations in usage of the is 0.381, which is not very high, but it’s much higher than the coefficient for usage of beautiful: -0.08. If we aggregate the data by decades the correlation for the become 0.92 while that for beautiful becomes 0.36.

As I’ve said, I don’t find this terribly puzzling. But Jockers notes that this is the only word he found that behaves like this. On the face of it, that is puzzling, but I’d like to know more.

In a footnote Jockers tells us he repeated this exercise for “the ten or so most frequent words and then abandoned the search” (p. 110). I’d like to know what those words were and see his charts. Just how different are they from the?

I’m guessing that they are all function words. If we look at the 450 million word Corpus of Contemporary American English–obviously a different corpus from what Jockers is using (3346 British and American novels)–the top ten most frequent words certainly are (in order): the, be, and, of, a, in, to (infinitive), have, to (prep.), it. What’s interesting is that the is considerably more frequent than be: 22038615, 12548525. Is that difference enough to give it greater distributional stability? I have no idea.

Addendum (8.14.14). Mark Liberman at Language Log has posted a comment on the use of the.

