Monday, April 27, 2015

On the Direction of Cultural Evolution: Lessons from the 19th Century Anglophone Novel

I've got another working paper available (title above):

Most of the material in this document was in an earlier working paper, Cultural Evolution: Literary History, Popular Music, Cultural Beings, Temporality, and the Mesh, which also has a great deal of material that isn’t in this paper. I’ve created this version so that I can focus on the issue of directionality and so I’ve dropped all the material that didn’t related to that issue. The last section, The Universe and Time, is new, as is this introduction.

* * * * *

Abstract: Matthew Jockers has analyzed a corpus of 19th century American and British novels (Macroanalysis 2013). Using standard techniques from natural language processing (NLP) Jockers created a 600-dimensional design space for a corpus of 3300 novels. There is no temporal information in that space, but when the novels are grouped according to close similarity that grouping generates a diagonal through the space that, upon inspection, is aligned with the direction of time. That implies that the process that created those novels is a directional one. Certain (kinds of) novels are necessarily earlier than others because that is how the causal mechanism (whatever they are) work. This result has implications for our understanding of cultural evolution in general and of the relationship between cultural evolution and biological evolution.

1. Introduction: Direction in Design Space, Telos? 2
2. The Direction of Cultural Evolution: The Child is Father or the Man 6
3. Nineteenth Century English-Language Novels 9
4. Macroanalysis: Styles 10
5. Macroanalysis: Themes 13
6. Influence and Large Scale Direction 15
7. The 19th Century Anglophone Novel 18
8. Why Did Jockers Get That Result? 20
9. What Remains to be Done? 21
10. Literary History, Temporal Orders, and Many Worlds 22
11. The Universe and Time 30

Introduction: Evolving Along a Direction in Design Space

In 2013 Matthew Jockers published Macroanalysis: Digital Methods & Literary History (2013). I devoted considerable blogging effort to it 2014, including most, but not all, of the material in this working paper. In Jockers’ final study he operationalized the idea of influence by calculating the similarity between each pair of texts in his corpus of roughly 3300 19th century English-language novels. The rationale is obvious enough: If novelist K was influenced by novelist F, then you would expect her novels to resemble those of F more than those of C, who K had never even read.

Jockers examined this data by creating a directed graph in which each text was represented by a node and each text (node) was connected only to those texts to which it had a high degree of resemblance. This is the resulting graph:


It is, alas, almost impossible to read this graph as represented here. But Jockers, of course, had interactive access to it and to all the data and calculations behind it. What is particularly interesting, though, is that the graph lays out the novels more or less in chronological order, from left to right (notice the coloring of the graph), though there was no temporal information in the underlying data. Much of the material in the rest of this working paper deals with that most interesting result (in particular, sections 2, 6, 7, 8, and 10).

What I want to do here is, first of all, reframe my treatment of Jockers’ analysis in terms of something we might call a design space (a phrase I take from Dan Dennett, though I believe it is a common one in certain intellectual circles). Then I emphasize the broader metaphysical implications of Jockers’ analysis.

A Design Space for Novels

How did Jockers calculate the similarity between his texts? Similarity with respect to what? What traits did he identify and how did he compare them?

To take a trivial example, he might have been interested in similarity with respect to length. Thus Jockers might have discovered that his shortest text was, say, 55,389 words long and his longest was 259,231 words long (I’m just making the numbers up). Then he’d have to pick some threshold value, say, 4000: Texts that differed by no more than 4000 words in length are considered to be similar; those differing by 4001 words or more are not similar. We can quibble over whether 4000 words is the right threshold, but we’ve got to pick some number, and so forth. Similarity with respect to word count seems trivial – and Jockers didn’t look at it – but I use it as an example just to make that point that we can’t just assert that two texts are similar. We have to have a procedure for making the determination.

What was Jockers’ procedure? What Jockers did, in effect, was to create a design space and then locate each text at a point in that design space. Once that has been done it is a relatively straightforward – if computationally intensive – matter to calculate the distance between each pair of points (that is, texts). The graph above is a two-dimensional projection of that design space.

But what is a design space? The idea is that we are dealing with objects, novels, that have been deliberately designed. Houses are deliberately designed and consist of things such as walls, floors, ceilings, windows, doors, plumbing, and so forth. A design space for houses would somehow characterize all the options available to designers of houses so that each individual house could be located at a point in this design space. Houses that are like one another would be close together in this design space.

You do not have to think about this very long to realize that, Hey! that’s crazy! Yes it is. And yet we want to do something like it for novels. What do novels consist of? Characters and actions, actions and desires, plot elements, words, sentences, paragraphs and chapters, and so forth. How do we represent that in some kind of space? And since we actually want to measure distance in this space, it has to be a real space, albeit an abstract one, not merely a spatial metaphor.

Rather than dive into deep narratology and come up with formal descriptive account of all the elements possible with novels Jockers took an empirical approach and created his design space by using statistical measures that reliably discriminate between novels with respect to various features. Jockers’ design space has roughly 600 dimensions, each representing a textual feature. Roughly 150 of those features were stylistic features ultimately based on a tradition of computational stylistics going back to the 1960s (see section 4: Macroanalysis: Styles). The other 450 features are semantic or thematic features and were derived through a relatively recent computational learning technique called topic modeling (see section 5: Macroanalysis: Themes). For the moment let us not worry about what these features are – you can read more about them later in this document and, ultimately, in Jockers’ book. Let us just take their existence for granted.

I want you to think about the fact that there are roughly 600 features being used to characterize some 3300 texts. We’ve got a design space of 600 dimensions. Is it adequate for those 3300 texts? Who knows? We’ve never before had a conceptual tool like this. We’re going to have to work with it for awhile – I’d think at least an intellectual generation – to get a good feel for what we can do with it. So let’s assume that the characterization of each text on 600 features is not a trivial matter, that the differences between texts with respect to these 600 features reflect meaningful differences in how people read, experience, and use these texts.

Direction in the Design Space for Novels

Look back at that graph. It is a two-dimensional representation of information that exists in a 600 dimensional design space. So lots of available information has been lost. What’s interesting, though, is that the left to right dimension of the graph corresponds to time, though there is no temporal information in any of those 600 features represented in the graph. There’s nothing about the graph itself that says that’s time; visual inspection will not reveal that.

Rather, Jockers discovered it by looking at the texts represented by those points and noticing that as he moved from left to right, he moved from the early to the late 19th century. I argue in sections 6 (Influence and Large Scale Direction) and 8 (Why Did Jockers Get That Result?) that that direction is, in effect, a diagonal through this design space and that it implies that the process that produced these texts is a directional process.

What is that process? It involves millions of people interacting with one another over the course of the 19th century. How do they interact? Through novels. They read them and discuss them, and a small number of them write and publish them. That’s a process we’re just beginning to understand.

That process, the reading and production of novels, IS directional on a scale of decades. What is the direction? I don’t know. It’s just some diagonal in a 600 dimensional design space. There is, as we know, a tradition of though that regards (Western) history as progressing. Is that what that direction is, progress? But that’s just importing a concept into the analytic apparatus from the outside. Maybe it’s progress, maybe it isn’t. But if you want to call it that, you need to provide an explicit argument that links the idea of progress to those 600 features.

No, all I’m asserting is that the process is directional and that the direction is defined with respect to a design space is that is derived from the texts themselves. That design space was developed to represent objects that were created through a complex process than evolved over the course of a century (and that extends before and after that interval). While none of the individual axes (dimensions) of the space are aligned with time, the similarity diagonal is. I think there is something (potentially very) deep there, something about the nature of that historical process, but I don’t know how to talk about that, though section 2 of this paper is a gesture in that direction.

Time and Phenomena

The last two sections of this paper are about time. I wrote section 10 (Literary History, Temporal Orders, and Many Worlds) last year when I was working on Macroanalysis, while section 11 (The Universe and Time) is more recent (a week or two ago). These two section provide a larger metaphysical context for my treatment of cultural evolution (through the example Jockers has provided).

For reasons I sketch out in those sections, I don’t believe we can treat time as an empty framework in which events happen. In some sense, time arises from events themselves. Thus the temporal order of cultural evolution is a different order of phenomena from the temporal order of biological evolution, which is in turn a different order of phenomena from that of the inanimate world. While the phenomena of human culture must be consistent with those of biology and of the inanimate world, they cannot be reduced to those phenomena.

Those two sections give some of my reasons for believing that. Those ideas are, of course, tentative. The real trick is to create a strong conceptual bridge between those metaphysical ideas and the evidence Jockers has provided us in Macroanalysis. When we’ve got that we’ll be on our way to understanding cultural evolution.

* * * * *

Let me remind you that this really is a working paper. There are redundancies in the text and, of course, things that could be spelled our more fully. I say this, not to make excuses, but simply to note that I am aware of the unpolished state of the text. I have been doing this long enough to know that, at this point however, the best thing for me to do is set this project aside for a while. I need to do some more thinking and messing around before it makes sense for me to begin tightening things up.

1 comment: