Thursday, July 18, 2013

Corpus Linguistics for the Humanist: Notes of an Old Hand on Encountering New Tech

I've published another working paper, this one on "digital humanities." Specifically, corpus linguistics. Here's the abstract and the introduction:
Abstract: Corpus linguistics offers literary scholars a way of investigating large bodies of texts, but these tools require new modes of thinking. Literary scholars will have to recover a kind of interest in linguistics that was lost when the discipline abandoned philology. Scholars will need to think statistically and will have to start thinking about cultural evolution in all but Darwinian terms. This working paper develops these ideas in the context of a topic analysis of PMLA undertaken by Ted Underwood and Andrew Goldstone.
Introduction: Theory in a Digital Age

In reading about so-called “digital humanities” over the last year or two I kept coming up against the question: What about Theory? In the context of academic humanities over the last three or four decades the term “theory” does not mean quite what it means elsewhere. In particular, in literary studies it does not mean what “literary theory” once meant: a body of theory about literature, its texts, writers, readers, history, and social influence. Rather, Theory (often capitalized) is a body of techniques for interpreting texts, for explicating their meanings in terms of some body of thinking about the mind, society, history, or general comport of the cosmos.

What about Theory, then, is a plea to link these new techniques to older concerns, to the concerns of ethical criticism, as broadly construed by Wayne Booth in The Company We Keep: An Ethics of Fiction. Ethical criticism is a worthy, indeed a necessary enterprise, but it is not the only worthy enterprise one can imagine. It is not the only form of knowing.

And it is not one the follows naturally from these new computer-enabled modes of inquiry. Hence the question: What about Theory? In the short-term my answer is: Forget about it! There’s nothing there at the moment. Perhaps later on, but not now.

For now there is only naturalist criticism, the attempt to understand literary texts and processes as entities and processes occurring in the world on a footing with sticks, planets, fungi, monsoons, and lemurs. That undertaking makes theoretical demands of its own, demands we’ve only begun to glimpse.

This series of reflections is a response to work in topic analysis undertaken by Ted Underwood and Andrew Goldstone, which is discuss in the first and third sections (save the appendix). The technique is descriptive and statistical. It is based on sophisticated statistical methods, about which there is an extensive body of statistical theory. One need not have a deep understanding of that theory in order to employ the technique, but one must have fairly sophisticated statistical intuitions and one must be willing to trust those who are expert in the statistics.

The technique is descriptive in the sense that you apply the computational techniques to a body of you get your results. Well, it’s not quite that simple, you get to play around a bit, but, in the end, the results are the results: a list of topics in the texts. The list is derived from a statistical analysis of the texts themselves, and is imposed or inferred by the analyst. It is a description of something that exists in a body of texts.

But it is a strange sort of description, one not easily and intelligibly characterized here. You must already know what the technique does in order for a short description to make sense. Still, let us say that a topic is a list of words that seem to occur together in texts. Just what such a topic is about, just why those words occur together, that’s up to the analyst to interpret. The analyst can, for convenience, assign a name to each topic. But that’s all the name is, a convenient handle.

Understanding just what these topics are, that’s going to require some theorizing, but of a new kind. Lacan, Foucault, Derrida and the rest won’t help here. These topics are indicators of currents, of whirls and eddies, in a society’s collective mind as it evolves over decades and longer. But what is that, a collective mind? The notion has been around in one form or another for quite some time. But now we have to explicate it in terms that are commensurate with these statistical techniques. That’s a theoretical enterprise and, I fear, a deep and complex one. It would be a shame to fail in the undertaking because we insist on old priorities first.

The complete working paper is available at my SSRN site and includes the following posts:


  1. You are the coolest neighbor! Did you know there are several lexicographers in JC?

  2. I didn't know that. But I'm not surprised. I know there's a linguist, John McWhorter, who appears frequently on