Google has just released an interesting dataset. Geoff Nunberg describes it at Language Log:
Culled from the Google Books collection, it contains more than 5 million books published between 1800 and 2000 — at a rough estimate, 4 percent of all the books ever published — of which two-thirds are in English and the others distributed among French, German, Spanish, Chinese, Russian, and Hebrew. (The English corpus alone contains some 360 billion words, dwarfing better structured data collections like the corpora of historical and contemporary American English at BYU, which top out at a paltry 400 million words each.)
It is, he says, “the largest corpus ever assembled for humanities and social science research.” The New York Times has reported on it and there’s an article in Science based on it.
You can also play around with it online with the Google Books Ngram Viewer. You enter individual words or phrases (up to five words long) and a Google graphs their frequency over time. I’ve spent a little time playing around with it.
In particular, I’m interested in the proper noun, “Xanadu.” As you may know, it’s the name of Kubla Khan’s summer capital and is also the second word in Coleridge’s most famous poem, “Kubla Khan.” Several years ago I did a Google search on “Xanadu” and was surprised to come up with over two-million hits. How’d that happen? I wondered.
I ended up writing a long post on The Valve, which generated an interesting discussion, and then distilling that down into a tech report. You can download the report here (One Candle, a Thousand Points of Light: The Xanadu Meme). Here’s the abstract:
I treat a single word 'xanadu', as a 'meme' and follow it from a 17th century book, to a 19th century poem (Coleridge's "Kubla Khan"), into the 20th century where it was picked up by a classic movie ("Citizen Kane"), an ongoing software development project (Ted Nelson's Project Xanadu), and another movie and hit song, Olivia Newton-John's Xanadu. The aggregate result can be seen when you google the word, you get 6 million hits. What is interesting about those hits is that, while some of them are directly related to Coleridge's poem, more seem to be related to Nelson's software project, Olivia Newton-John's film and song, and (indirectly) to Welles' movie. Thus one cluster of Xanadu sites is high tech while another is about luxury and excess (and then there's the Manchester Swingers Club Xanadu).
In doing this work I had problem getting historical data. The web hits are all, of course, indicators on current usage, namely on pages on the web. But how did that current usage evolve from past usage? I checked the Oxford English Dictionary, which gave examples of usage from Coleridge’s own source, a 17th century travelogue, to the present. But that’s only isolated examples, without any sense of frequency. I also searched The New York Times archives from 1851 through 1980. Here’s what I found (from the tech report):
I got 443 hits; only 3 of them were in 1900 or earlier. I checked only the first one, an article on child labor from 1870 and was unable to find “xanadu” in the article itself and so have no idea why it turned up in the search. I checked a few later articles, including a review of Livingston Lowes’s The Road to Xanadu (1927) and one of a Coleridge biography. There were a number of articles about yachting and ocean racing in the 1930s and into the 1940s. I checked two of them. One mentions a yacht named “X Anadu” and another mentions one named “Xanadu.” While the first is probably a typo, I would not automatically assume so; yachts are often oddly named. The point is that entries are sparse between 1900 and the 1940s. The New York Times review for Citizen Kane, however, was 77th in the list of 443 hits over the period between 1851 and 1980. Thus over 80% of occurrences of “Xanadu” are dated after the premier of Citizen Kane.
Of course, The New York Times is only one source and can hardly be considered as representative of usage in the English-speaking world. But it’s what I could get at the time.
Google’s new toy affords me a more complete look, though it only tracks occurrences in books, not in periodicals. So I graphed “Xanadu” from 1800 to 2008 (alas, you have to click on the link and open another window). There you see a gradual rise in usage from the late 1830s (“Kubla Khan” was first published in 1817) up through 1920. Then there’s a sharp rise in the late 1920s and peaks around 1934 (see this graph, 1910-1950), well before Citizen Cane. What’s that about? Here’s a listing of books published in 1934, a number of which mention The Road to Xanadu. I’ve not explored the whole list (there seems to be over 300).
So, at least in the world of books there was a fair amount of interest in “Xanadu” before Citizen Kane. What’s equally interesting is that 1934 remains the peak up to the present. Though here we’ve got to be careful. The graph shows percentages, not absolute numbers. So “Xanadu” is relatively more frequent in 1934 than in any year before or since. But it may well have been used more often, in absolute terms, in subsequent years.
We don’t know.
Which is how I feel about this new tool, interesting, but we don’t know. I’m sure it will tell us something, but just what . . . .