Thursday, July 21, 2011

Distribution of Paragraph Lengths, What’s Up?

A couple of days ago I put up a post about the lengths of paragraphs in Conrad’s Heart of Darkness. I sent notice of the post to a number of people, including Cosma Shalizi, and Mark Liberman. Mark then put up a post at Language Log where he: 1) reported work of his own on HoD, 2) reported work that Shalizi had done, and 3) reported work he did on Nostromo and The Golden Bowl. There was some discussion at Language Log as well.

I don’t know quite what I think of this. It’s been interesting, but . . .

So, in this post I will: 1) restate my original observation, without the rhetorical frills of my original post, and 2) append two longish comments I made at Language Log. In the first comment I suggest that patterns of paragraphing are to prose (fiction) as, say, verse forms are to poetry. The second comment outlines a pilot study, one that alas, I do not quite have the resources to carry out myself – though, if I can learn a bit of Python, who knows? I finish with a note on a parallel matter.

Paragraphing in Heart of Darkness

The central matter involves four observations about Heart of Darkness, two quantitative and two qualitative. This post is almost entirely about the quantitative observations, but the qualitative observations provide useful context. A long-term research objective would to, of course, to somehow ‘bridge the gap’ between those two sets of observations.

I’ve been working with a text that I downloaded from Project Gutenberg. In that text Heart of Darkness consists of 198 paragraphs. I counted the number of words in each paragraph using the word-count function in Microsoft Word, loaded the results into a spreadsheet, and made two charts. Think of these charts as being an abstract kind of X-ray image of “the text.” We’re now looking at “internal organs” not otherwise visible.

In this chart the paragraphs are ordered as they occur in the text, the first paragraph at the left and the last at the right:

HD whole

It’s rather spiky, as you would expect. There are a few long paragraphs, there are many short paragraphs, and there are paragraphs in-between. But those different length paragraphs are all mixed together in the text. The result is long paragraphs sticking up out of plains and rolling hills of short and mid-size paragraphs.
 
Notice that the longest paragraph is a bit to the right of center, and that it is flanked by two slightly shorter paragraphs (with valleys in-between). That gives the distribution the overall shape of a pyramid. One would like to know whether that pyramid shape is important or not. If those three paragraphs were, say, only 800 or so words long, you’d still have a spiky shape, but the pyramid would be gone.

In this second chart I sorted the paragraphs in order by length, longest to shortest:

HD whole ordered 2

That surprised me. I didn’t have any particular expectation, but to see such a relatively smooth curve, with the high left end . . . What is it?

And that’s still the question: What is it? Is it anything at all?

Now for the two qualitative observations. First, that longest paragraph that’s just after the mid-point of the story. It’s almost entirely about Kurtz, the enigmatic darkness at the heart of this story, giving us his background, his hopes, and dark hints about what had happened to him in Africa. That’s the structural center of the story, which I argue at some length in The Heart of Heart of Darkness. Given that structural centrality, I don’t think that the extreme length of this paragraph (1500 words) is an accident.

The second qualitative observation concerns the string of short paragraphs at the far right end of the temporal distribution. That’s a conversation between Marlow and the Intended, with each paragraph being a single conversational turn. That’s the longest conversation in the text, and the only one between a man and a woman. All the other conversations are between men and they’re all internal to a single paragraph. This is important because, at the very beginning of that longest central paragraph, Marlow separates the world of men from the world of women:
I laid the ghost of his gifts at last with a lie," he began suddenly. "Girl! What? Did I mention a girl? Oh, she is out of it—completely. They—the women, I mean—are out of it—should be out of it. We must help them to stay in that beautiful world of their own, lest ours gets worse. Oh, she had to be out of it. You should have heard the disinterred body of Mr. Kurtz saying, 'My Intended.' You would have perceived directly then how completely she was out of it.
But, as I said above, I want to set those two qualitative observations aside. They are specific to this text. This text is about a man who looses his mind on station in the Congo; but not all texts feature such a character. Nor do all texts end with a discussion between a man and a woman.

But all texts that are divided into paragraphs must, by that fact, have both a distribution of paragraph size by order in the text, such as in my first chart, and a distribution order by size, the second. The question is: what do those distributions look like and are they worth looking into? That’s what the next two sections of this post are about.

Prose Form

Commenting at Language Log, JL said:
First off, please bear in mind that any narrative is more or less organically made, and while it'll be possible to find all sorts of patterns in it, circles and spirals or what have you, that's going to be a critical superimposition upon what, for the author, is almost certainly an unconscious, or at least, less rule-bound process. Poems, to some degree, lend themselves to this sort of quantitative scrutiny. Novels don't — except bad ones.
That comparison with poetry got me thinking.

Poetry is very much about manipulating the physical substance of language, rhyme and meter, and scads of other sound patterns, many of which have Greek names, etc. And we’ve got scads of verse forms, which are listed in handbooks, etc. What’s the parallel phenomenon for prose fiction? Where are the lists of ways and forms of language manipulation in prose fiction?

They don’t exist. We distinguish between novels, novellas and short stories. And we talk about style, and analyze it in various ways, including statistics – statistical stylistics is a fairly well-developed discipline. But we don’t have lists of devices and forms. Maybe, as JL pretty much said, they don’t exist.

And maybe we just haven’t known how to look for them.

What I’m thinking is that those patterns of paragraph-length distribution are to prose fiction what patterns of, say, line length and rhyme are to poetry. It’s the basic physical stuff the writer is manipulating in the course of creating patterns of verbal meaning.

So, in Heart of Darkness we have one pattern, by which I mean BOTH the distribution by size and the distribution by temporal order. Nostromo exhibits a different pattern from HoD; it has a similar size distribution but a different time distribution. The Golden Bowl has still another pattern. How many such patterns are there? What are they like?

Further, it’s clear to me that each chapter needs to be examined individually. The first chapter is the only one that starts from nothing; the last chapter is the only that ends in nothing. The inner chapters all have to pick up a story in progress, move it forward, and then leave it unfinished. Does that yield different patterns of paragraph length? Don’t know. Have to check.

And so forth.

Pilot Study

If we’re looking at anything at all, I don’t think it would be like Zipf’s Law, which characterizes a distribution you get for every text. I think were dealing with a variety of patterns. After all, we’ve already done three texts and we have three patterns. But they’re not wildly different. They’re all recognizably – by Liberman's method of interocular trauma – in the same universe.

So, let’s say we were to do a pilot study of, say, 30 texts. How should we choose our texts?

The idea is to sample the space so we can figure out if we should make a serious commitment and do, say, 1000 or 10,000 texts. We’d also like to learn something that would be useful in doing that larger study.

We’ve already begun work on three texts, Heart of Darkness, Nostromo, and The Golden Bowl. We need 27 more. We could do a random draw from, say, Project Gutenberg’s list. But I think there’s a more useful way to “sample the space” of possible patterns. Here’s what I’d do:

1) Eight more Joseph Conrad texts.

2) Use The Golden Bowl and pick nine more texts from 1900 plus or minus 25 years. Let’s get a mix of American and British, male and female, high art and popular.

3) Five chosen because they’d be interesting and “different” from the above, say: The King James Bible, Tristram Shandy, Pride and Prejudice, Ulysses, Naked Lunch.

4) Five at random.

The first set will give us a feel for the range of patterns in a set of texts that are “constrained” to the capabilities of a single author. The second set gives us a selection of distinctly different authors that are, however, from the same time period as the first. The last two sets go wider still.

Words, Another Discussion

Written texts consist of a string of characters. Spaces divide the string into individual words. Punctuation organizes words into sentences. And paragraphing groups sentences together into larger structural units, a “single thought” as the textbooks say.

Well, back in the old days, they didn’t even divide written texts into separate words, let alone into sentences and paragraphs. John Holbo looks at word breaks at post at Crooked Timber: Thoughts About Sight Reading, and Inner Listeners.

1 comment: