Tuesday, June 21, 2016

Limitations of the Google Books Corpus for Drawing Inferences about Cultural and Linguistic Evolution

Pechenick EA, Danforth CM, Dodds PS (2015) Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLoS ONE 10(10): e0137041. doi:10.1371/journal.pone.0137041

Abstract: It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.


The Google Books data set is captivating both for its availability and its incredible size. The first version of the data set, published in 2009, incorporates over 5 million books [1]. These are, in turn, a subset selected for quality of optical character recognition and metadata—e.g., dates of publication—from 15 million digitized books, largely provided by university libraries. These 5 million books contain over half a trillion words, 361 billion of which are in English. Along with separate data sets for American English, British English, and English Fiction; the first version also includes Spanish, French, German, Russian, Chinese, and Hebrew data sets. The second version, published in 2012 [2], contains 8 million books with half a trillion words in English alone, and also includes books in Italian. The contents of the sampled books are split into case-sensitive n-grams which are typically blocks of text separated into n = 1, …, 5 pieces by whitespace—e.g., “I” is a 1-gram, and “I am” is a 2-gram

A central if subtle and deceptive feature of the Google Books corpus, and for others composed in a similar fashion, is that the corpus is a reflection of a library in which only one of each book is available. Ideally, we would be able to apply different popularity filters to the corpus. For example, we could ask to have n-gram frequencies adjusted according to book sales in the UK, library usage data in the US, or how often each page in each book is read on Amazon’s Kindle service (all over defined periods of time). Evidently, incorporating popularity in any useful fashion would be an extremely difficult undertaking on the part of Google.

We are left with the fact that the Google Books library has ultimately been furnished by the efforts and choices of authors, editors, and publishing houses, who collectively aim to anticipate or dictate what people will read. This adds a further distancing from “true culture” as the ability to predict cultural success is often rendered fundamentally impossible due to social influence processes [3]—we have one seed for each tree but no view of the real forest that will emerge.

We therefore observe that the Google Books corpus encodes only a small-scale kind of popularity: how often n-grams appear in a library with all books given (in principle) equal importance and tied to their year of publication (new editions and reprints allow some books to appear more than once). The corpus is thus more akin to a lexicon for a collection of texts, rather than the collection itself. But problematically, because Google Books n-grams do have frequency of usage associated with them based on this small-scale popularity, the data set readily conveys an illusion of large-scale cultural popularity. An n-gram which declines in usage frequency over time may in fact become more often read by a particular demographic focused on a specific genre of books. For example, “Frodo” first appears in the second Google Books English Fiction corpus in the mid 1950s and declines thereafter in popularity with a few resurgent spikes [4].

While this limitation to small-scale popularity tempers the kinds of conclusions we can draw, the evolution of n-grams within the Google Books corpus—their relative abundance, their growth and decay—still gives us a valuable lens into how language use and culture has changed over time. Our contribution here will be to show:
  1. A principled approach for exploring word and phrase evolution;
  2. How the Google Books corpus is challenged in other respects orthogonal to the its library-like nature, particularly by the inclusion of scientific and medical journals; and
  3. How future analyses of the Google Books corpus should be considered.
For ease of comparison with related work, we focus primarily on 1-grams from selected English data sets between the years 1800 and 2000. In this work, we will use the terms “word” and “1-gram” interchangeably for the sake of convenience. The total volume of (non-unique) English 1-grams grows exponentially between these years, as shown in Fig 1, except during major conflicts—e.g., the American Civil War and both World Wars—when the total volume dips substantially. We also observe a slight increase in volume between the first and second version of the unfiltered English data set. Between the two English Fiction data sets, however, the total volume actually decreases considerably, which indicates insufficient filtering was used in producing the first version, and immediately suggests the initial English Fiction data set may not be appropriate for any kind of analysis.

Because of Google Books library-like nature, authors are not represented equally or by any measure of popularity in any given data set but are instead roughly by their own prolificacy. This leaves room for individual authors to have noteworthy effects on the dynamics of the data sets, as we will demonstrate in Section Results and Discussion.

Lastly, due to copyright laws, the public data sets do not include metadata (see supporting online material [1]), and the data are truncated to avoid inference of authorship, which severely limits any analysis of censorship [1, 12] in the corpus. Under these conditions, we will show that much caution must be used when employing these data sets—with a possible exception of the second version of English Fiction—to draw cultural conclusions from the frequencies of words or phrases in the corpus.

We structure the remainder of the paper as follows. In Sec. Methods, we describe how to use Jensen-Shannon divergence to highlight the dynamics over time of both versions of the English and English Fiction data sets, paying particular attention to key contributing words. In Sec. Results and Discussion, we display and discuss examples of these highlights, exploring the extent of the scientific literature bias and issues with individual authors; we also provide a detailed inspection of some example decade–decade comparisons. We offer concluding remarks in Section Concluding Remarks.

No comments:

Post a Comment