Tuesday, April 26, 2016

From Telling to Showing, by the Numbers

I've been thinking about some remarks Moretti made about the digital humanities in  a recent interview. Among other things he suggested that the results of computational criticism have so far been disappointing. But he also held up Lit Lab Pamphlet #4 as an example of "an intelligence that takes the form of writing a script, but in the writing of the script there is also the beginning of a concept, very often not expressed as a concept, but that you can see that it was there from the results that the coding produces." Here's what I wrote about that pamphlet back in October of 2012.
I’ve just looked at a pamphlet from Stanford’s Literary Lab: Ryan Heuser and Long Le-Khac, A Quantitative Literary History Of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method (68 page PDF), May 2012. I’ve not read it in detail, but only blitzed my way through, looking for the good parts. Well, not even all of those. I was just looking to get a sense of what’s going on.

Which I did. And I like it. THIS is the sort of work I want to see from ‘digital humanities.’ Not the only sort, but it’s one of the things we can do with ‘big data’ and pretty much only do with big data. If traditional humanists can’t see value in this kind of work, well, then forget about them.

First I’ll give you the abstract, then I’ll quote a bunch and make some comments.

Authors’ abstract
The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.

This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.
Looking for Patterns

So: Heuser and Le-Khac have a corpus digital texts consisting of some 3000 British novels spanning the 19th Century. By doing this that and the other (let’s just call it correlation analysis, see pp. 6 ff.), we identify two semantic cohorts consisting of words whose usage seems to be inversely related as the century goes along. One set of words is most frequent early in the century while the other set is most frequent near the end.

One cohort is abstract and has to do with character and ethics, moral evaluation. Here’s a bunch of them (p.13):
integrity, modesty, sensibility, reason, talent, conduct, elegant, ostentation, partiality, friendship, accomplishment, character, persevere, vanity, forbear, benevolence, assiduity, understanding, extravagance, zeal, delicacy, firmness, envy, reluctance, excellence, vexation, esteem, virtue, prejudice, unrelenting, accomplish, sincere, nobility, taste, sedulous, admiration, sentiment, rational, brilliancy, falsehood, prudent, excess, superiority, unworthy, malignant, sensible, genius, reflection, pleasure, dignify, artifice, happiness, indolence, principle, discernment, coldness, self-denial, depravity, indulge, infamy, malice, faultless, adherence, perseverance, profligate, aversion, penetration, solicitous, despise, indulgence, ardent, candour, softness, restraint, impatience, insensibility
These words were important early in the century, with many occurrences, but they declined over the course of the century.

Then, after a bit of trial and error, they went looking for words likely to co-occur along with the word “hard”—hence they called them the “hard seed” words (they started one of their algorithms by seeding it with specific words to see what they would correlate with). They found a bunch of them. This set included the following (p. 20):
  • Action verbs: “come,” “go,” “drop,” “stand,” “touch,” “see”…
  • Body parts: “finger,” “face,” “hair,” “chin,” “hand,” “fist” …
  • Colors: “red, “white,” “blue,” “green,” “brown,” “scarlet” …
  • Numbers: “three,” “five,” “two,” “seven,” “eight,” “four” …
  • Locative and directional adjectives and prepositions: “down,” “out,” “back,” “up,” “over,” “above” …
  • Physical adjectives: “hard,” “rough,” “flat,” “round,” “clear,” “sharp” …
Over the course of the century these words showed the opposite behavior from the abstract words: they started low and finished high.

Here’s a general characterization of the differences between the two sets of words (p. 27):
As we did with the abstract values fields, we looked closely at the shared characteristics of the “hard seed” words. The comparison with the abstract values words was particularly revealing. As opposed to abstractions, the “hard seed” words are concrete and physical—“ wet,” “stiff,” “crack,” “knock,” “jaw,” “neck,” etc. They are also specific, words used to specify the particular action (“stoop,” “scratch,” “tilt,” “crawl”…), physical orientation (“over,” “under,” “behind”…), physical quality (“heavy,” “wooden,” “crooked”…), color (“yellow,” “purple”, “orange,” “ruddy”…), or quantity (“ten,” “sixty,” “hundred,” “thousand”…) of an object or person. Where the abstract values words were evaluative and highly polarized, these words are non-judgmental, too rooted in the physical to refer in any direct way to abstract norms, values, and standards. And where the abstract values words were long and Latinate, these are short, often monosyllabic, and predominantly Anglo-Saxon in origin. In the context of the novel, the “hard seed” word cohort can be collectively characterized as concrete description words of a direct, everyday kind. It is these kinds of words that are rising significantly in usage over the nineteenth century.
In order to corroborate these results, Heuser and Le-Khac went through their 3000-volume corpus with a different technique (topic modeling) to see what it would come up with. Would the results be consistent with or at variance with their initial correlation analysis?

They were consistent. “What’s important here is that these same dramatic trends were found by entirely independent methods, confirming that our results are not an anomalous product of our methods but a real historical transformation in the nineteenth-century British novel” (p. 29). Reading on:

The abstract values fields at their height account for about 1% of all word usage in nineteenth-century British novels; the “hard seed” fields, almost 6%. These are large-scale, diffuse trends, encompassing the histories of hundreds and hundreds of words. Recognizing the scale of these changes made us all the more eager to probe into the data. What might these changes mean?

From Telling to Showing

Roughly put, what seems to be going on, as the title of my post suggests, is a shift from telling to showing. Early in the century novelists would tell you about a person’s character by using abstract words. Toward the end of the century they’d show their character through concrete description. There’s an interesting argument about shift in social space (from intimate to public) that I won’t go into, but I’ll give two specific examples from the pamphlet.

The first example is from Dickens’ Great Expectations (serialized in 1860-61). The words in bold face are from the ‘hard seed’ cohort:
Casting my eyes on Mr. Wemmick as we went along, to see what he was like in the light of day, I found him to be a dry man, rather short in stature, with a square wooden face, whose expression seemed to have been imperfectly chipped out with a dull-edged chisel. There were some marks in it that might have been dimples, if the material had been softer and the instrument finer, but which, as it was, were only dints. The chisel had made three or four of these attempts at embellishment over his nose, but had given them up without an effort to smooth them off. I judged him to be a bachelor from the frayed condition of his linen, and he appeared to have sustained a good many bereavements; for, he wore at least four mourning rings, besides a brooch representing a lady and a weeping willow at a tomb with an urn on it. I noticed, too, that several rings and seals hung at his watch-chain, as if he were quite laden with remembrances of departed friends. He had glittering eyes - small, keen, and black - and thin wide mottled lips. He had had them, to the best of my belief, from forty to fifty years.
Heuser and Le-Khac emphasize “the density of physical description” and the way “character emerges from this tableau of physical details” (p. 42).

They contrast Dickens with a contrast from Jane Austen, Pride and Prejudice (1813):
Mr. Collins was not a sensible man, and the deficiency of nature had been but little assisted by education or society; the greatest part of his life having been spent under the guidance of an illiterate and miserly father; and though he belonged to one of the universities, he had merely kept the necessary terms, without forming at it any useful acquaintance. The subjection in which his father had brought him up, had given him originally great humility of manner, but it was now a good deal counteracted by the self-conceit of a weak head, living in retirement, and the consequential feelings of early and unexpected prosperity. A fortunate chance had recommended him to Lady Catherine de Bourgh when the living of Hunsford was vacant; and the respect which he felt for her high rank, and his veneration for her as his patroness, mingling with a very good opinion of himself, of his authority as a clergyman, and his rights as a rector, made him altogether a mixture of pride and obsequiousness, self-importance and humility. (Italics added by Heuser and Le-Khac.)
Heuser and Le-Khac: “There are no physical details to speak of and they aren’t necessary because there’s no need for perception or inference” (p. 43).

Comparing the two passages (p. 43):
The contrast between the two descriptions is extreme. From Collins’s description to Wemmick’s, there is a complete disappearance of abstract values words and an increase in the frequency of hard seed words of almost 600%. Setting the two side by side lets us see our data trends more tangibly on the page as a change in the very linguistic texture of these novels. But it also lets us see clearly that this is more than a change in word choice, it’s a change in representation. Where characterization in Pride and Prejudice is definitive, direct, and evaluative, in Great Expectations it is ambiguous and inferential; not only are the character traits implicit behind surfaces of physical detail, but the valuation of those traits is at a further remove from clarity. The passage from Great Expectations, the canonical city novel richest in hard seed words, then exemplifies a mode of characterization that presumes no direct access to character, a mode characteristic of the less knowable community of urban social spaces.
That contrast, more or less, is what Heuser and Le-Khac discovered in their corpus. They have further discussion of a general nature, including a discussion of why that shift might have occurred—here’s where social space comes in, with the population shift from the country to the city, but that’s the gist of their finding.

Given the two passages from Austen and Dickens anyone can note the differences that they have noted. But no one’s going to read 3000 novels looking for those specific differences. For that you want machine help.

Exploring the Corpus

But, of course, Heuser and Le-Khac didn’t start their work with the distinction between showing and telling in hand. Rather, they found it in their corpus of novels using methods of computational exploration. Once they’d found something with computer methods, they then engaged in some good old-fashioned reading in order to make sense of what they’d found computationally.

And I want to emphasize “exploration.” They didn’t start out by following a prepared recipe. They went looking for things without any specific goal in mind. Now that they HAVE discovered something, well, others can do what they’ve done. But, for the foreseeable future, humanists will have to devote plenty of effort to such exploration.

