Showing posts with label read_macroanal. Show all posts
Showing posts with label read_macroanal. Show all posts

Tuesday, September 10, 2019

Reading Macroanalysis 7.3: Style, Genre, Time, and Influence

This post is from Aug. 31, 2014, but I'm bumping it to the top of the queue as I am thinking about these matters in connection with Moretti and Sobchuk, Hidden in Plain Sight: Data Visualization in the Humanities (New Left Review 118, 2019, 86-119). They don't discuss this visualization, but they should have.
In this post I suggest some studies I’d like to be done. I begin by recalling Moretti’s account of genre succession from Maps, Graphs, Trees in the context of Jockers’ massive graph of literary influence. Then I revisit the “Style” chapter and look at some of the work I passed over when I first posted on that chapter, the work related to Moretti’s generational observation. I then make some suggestions about how we could infer quasi-genres in the data assembled to build the influence graph and thereby extend Jockers’ work on style from his limited corpus of 106 texts to the larger corpus of 3346 texts. I conclude with some vague and tentative remarks about the pattern of reader interest betrayed in the record we’ve been examining, that of book publication.

Influence and Genre Succession

I’ve been thinking a lot about two things: 1) Moretti’s argument in Graphs, Maps, Trees that genres tend to cluster into 30 year cycles, and 2) Jockers’ massive graph in which all 3346 texts in his corpus are linked by relations of similarity, producing a graph that looks like this (which is Figure 9.3, p. 165; color version from the web):

9dot3

As Jockers points out, what’s remarkable about this graph is that the nodes are ordered in time from left (oldest) to right, but there is no temporal information in the data from which it was derived: “Books are being pulled together (and pushed apart) based on the similarity of their computed stylistic and thematic distances from each other” (p. 164).

That temporal ordering is a side effect of ordering by thematic and stylistic similarity. But, in the abstract, it could have been otherwise, no? Why should positioning texts near similar texts result in temporal ordering? (Would the same thing be true of 20th Century texts?) This ordering implies that the evolution of literary culture IS directional, but Jockers himself hasn’t posited any telos, nor do I see any need to do so. That directionality stems from the internal dynamics of the system. Authors, and I assume audiences as well, want to stick with what they know, and what they know was published in the previous years.

It seemed to me that Moretti’s cycles must somehow be in that graph, for all the texts in a given cycle are close together in time, by definition, as well as similarity. Alas, the whole corpus has not been coded for genre (p. 158). Is there some way we can back into genre since we’ve got this massive graph based on similarity relations among texts along 578 dimensions? Aren’t texts within the same genre more likely to resemble one another than texts in different genres?

The other thing on my mind is the fact that what really interests me is what’s on people’s minds and how that evolves over time. Some books will attract few readers, some books many readers; but the mere fact that a book has been published doesn’t speak to that. Moreover, books can be read long after they’ve been published. In the case of Moby Dick, it would seem that, for the most part it was read only long after it was published. Publication history is, at best, an indirect proxy measure of that.

And yet that history IS a history. Assuming that publishers are for the most part rational economic actors who want to turn a profit, their decisions on what to publish must take into account their sense of what people are reading and therefore what they’re buying. And the kinds of books that got published changed from one decade to the next. That record of  changes must reflect changes of reading taste.

Wednesday, September 3, 2014

Reading Macroanalysis: Notes on the Evolution of Nineteenth Century Anglo-American Literary Culture

Matthew L. Jockers. Macroanalysis: Digital Methods & Literary History. University of Illinois Press, 2013. x + 192 pp. ISBN 978-0252-07907-8

I've compiled all the posts into a working paper. HERE's the SSRN link. Abstract and introduction below.

* * * * *

Abstract: Macroanalysis is a statistical study of a corpus of 3346 19th Century American, British, Irish, and Scottish novels. Jockers investigates metatdata; the stylometrics of authorship, gender, genre, and national origin; themes, using a 500 item topic model; and influence, developing a graph model of the entire corpus in a 578 dimensional feature space. I recast his model in terms of cultural evolution where the dynamics are those of blind variation and selective retention. Texts become phenotypical objects, words become genetic objects, and genres become species-like objects. The genetic elements combine and recombine in authors' minds but they are substantially blind to audience preferences. Audiences determine whether or not a text remains alive in society.

* * * * *

Introduction: Get in the Driver’s Seat

I knew it was going to be good. But not THIS good. A better formulation: I didn’t know it would good in THIS way, that it would put me in driver’s seat, if only in a limited way.

The driver’s seat, you ask, what do you mean? In this case it means that I could actively work with the data. When, for example, I read Moretti’s Graphs, Maps, Trees, I read it as I do pretty much any book, though this one had a bunch of charts and diagrams, which is unusual for literary criticism. There wasn’t anything for me to do other than just read.

If I didn’t have ready access to the web, reading Macroanalysis would have been the same. But I do have web access and I use it all the time. So, when I got to Chapter 8, “Theme,” I also accessed the topic browser that Jockers had put on the web. Through this browser I could explore the topic model Jockers used in the book and, in particular, I could use it to investigate matters that Jockers hadn’t considered.

So I moved from thinking about Jockers’ work to using his work for my own intellectual ends. I ended up writing four posts (6.1 – 6.4) on that material totaling almost 12,000 words and I don’t know how many charts and graphs, all of which I got from Jockers’ web site. Once I’d worked through an initial curiosity about a spike that looked like Call of the Wild (but wasn’t, because that text isn’t in the database) I settled into some explorations framed by Leslie Fiedler’s Love and Death in the American Novel, Melville’s Moby Dick, and Edward Said’s anxiety on behalf of the autonomous existence of the aesthetic realm.

Data is Independent of Interpretations

You can do that as well, or whatever you wish. While the web browser gives you only limited access to Jockers’ corpus, that access is real and useful. A lot of work in digital criticism, and digital humanities in general, is like that. It produces ‘knowledge utilities’ that are generally useful, not just the private preserves of the original investigator.

There is an important epistemological point here as well. Jockers was led to this work by a certain set of intellectual concerns. Some of those concerns are quite general–about literature and the novel–while others are more specific–he has a particular interest in Irish and Irish-American literature. But I had no trouble putting his results to use in service of my own somewhat different interests.

Reading Macroanalysis 7: Influence, or the evolving dynamic integrity of the aesthetic sphere [REVISED]

Note: I decided that we needed a more explicit account of how Jockers visualized his 3346-node influence graph. I've inserted that account into the middle of the text and added subheadings.
I opened this investigation of Macroanalysis with the following paragraph:
The book arrived midway last week, when I hadn’t even finished reading Tim Morton’s Hyperobjects, much less finished blogging about it. But that didn’t stop me from giving Macroanalysis a look-thru: contents, some of the figures, read a bit here and there. I ended up reading Chapter 9, “Influence”, first; I’d read Matt Wilkins’ review in the LA Review of Books:
It’s a nifty approach that produces a fascinatingly opaque result: Tristram Shandy, Laurence Sterne’s famously odd 18th-century bildungsroman, is judged to be the most influential member of the collection, followed by George Gissing’s unremarkable The Whirlpool (1897) and Benjamin Disraeli’s decidedly minor romance Venetia (1837). If you can make sense of this result, you’re ahead of Jockers himself, who more or less throws up his hands and ends both the chapter and the analytical portion of the book a paragraph later.
Would I be able to make sense of those results? thought I to myself as I read. Nope, I couldn’t. Better luck next time.
I am now prepared to offer a re-interpretation of those results. But before I do that I need to explain more or less what Jockers is doing in this the final analytical chapter of the book. How does he operationalize the concept of influence?

What is influence?

When we say that, for example, that J. K. Rowling was influenced by the Narnia novels of C. S. Lewis, what do we mean? We mean that she read them and has incorporated features of those books into her own work. There is a direct relationship between Rowling’s activities and those influential books.

Influence thus understood is something that ‘travels’ along certain paths in the enormous meshwork of reading and writing transactions that constitute literary culture. As there are only a relatively few writers in that network, and only a relatively few of their transactions are writing ones (let’s say that the writing of a book is a single transaction) most of the transactions in the network are readings. Only a few of the transactions in the meshwork carry influence.

But Jockers doesn’t have access to that meshwork. None of us does. To be sure, we can see bits and pieces of it here in there in diaries, letters, and published reviews, but most of the transactions are lost to history. We can only look for the effects of those transactions.

And that’s what Jockers does. He assumes, reasonably enough, that if one author is influenced by another, then we should see indicators of that influence in the work. There should be a noticeable resemblance between those works.

And that is something Jockers can look for. For each of his 3,346 texts he’s got a bunch of features, stylistic and thematic. Once he’s tossed out the uninterpretable thematic features he’s left with 578 features for each text. He then represents this information as a geometric space with 578 dimensions, one for each feature, in which we have 3346 points, one for each text. He can now calculate the distance between any two texts, that is, points, in this high dimensional feature space. That distance is a measure of the similarity between the texts.

That’s what he does, and he gives us examples of the results. For each of Pride and Prejudice, Tale of Two Cities, and Moby Dick he gives us a table listing the ten novels the shortest distance from them and thus most like them (in terms of these 578 features). Not surprisingly, other books by Austen, Dickens, and Melville respectively occupy the top slots on these similarity lists. For the Austen list, the other authors are female, but one (Thomas Lister). Similarly, the authors most like Dickens are male, though the author of Life’s Masquerade (10th) is unknown, hence gender unknown. All the authors on these two lists are British. In Melville’s case, the list is also all-male, but not all-American. Two Scots, Robert Ballantyne and Robert Louis Stevenson, also made the list.

As interesting as this is, Jockers points out that it’s a bit small scale. We need something else if we want to gauge influence throughout the century. What to do?

Monday, September 1, 2014

[8] From Macroanalysis to Cultural Evolution

The purpose of this post is to recast the work reported in Macroanalysis: Digital Methods & Literary History in terms appropriate to cultural evolution. The idea is to propose a model of cultural evolution and assign objects from Jockerss analysis to play roles in that model. I will leave Jockers’ work untouched. All I’m doing is reframing it.

Before doing that, however, I should note that in the last quarter of a century or so there has been quite a lot of work on cultural evolution in a variety of discipline including linguistics, anthropology, archaeology, and biology. Though it must be done at some time, I have no intention of even attempting to review that work here and so to place the scheme I propose in relation to it. That’s a job for another time and another venue. I note, however, that I have done quite a bit of work on cultural evolution myself and that some of that discussion can be found in documents I list at the end of this post.

Why Evolution?

First of all, why bother to recast the processes of literary history in evolutionary terms at all? Jockers wrote an excellent book without creating an evolutionary model, though he mentioned evolution here and there. What’s to be gained by this recasting?

As far as I can tell, much of the work that has been done on cultural evolution has been undertaken simply to exercise and extend the range of evolutionary discourse. It has not, as yet, resulted in an understanding of cultural process that is deeper than more conventional forms of historical discourse. Much of my own work has been undertaken in this spirit. I believe that, yes, at some point, evolutionary explanation will prove more robust that other forms of explanation, but we’re not there yet.

This work in effect is looking to evolutionary accounts as exhibiting something like formal cause in Aristotle’s sense. Evolutionary accounts are about distribution of traits across populations. In biology such accounts have a characteristic formal appearance so that, e.g. phylogenetic analysis of a population of entities tends to “look” a certain way. So, in the cultural sphere, let’s conduct a similar analysis and see how things look even if we don’t have our entities embedded in the kind of causal framework that genetics and population biology, molecular biology, and developmental biology provide the biologist.

That’s fine, as long as we remind ourselves periodically that that’s what we’re doing. But we must keep looking for the terms in which to construct a causal model.

What I specifically want from an evolutionary approach to culture is
  • a way to think about Said’s autonomous aesthetic realm,
  • a way to prove out Shelley’s assertion that “poets are the unacknowledged legislators of the world,”
  • a way of restoring agency to writers and readers rather than casting them as puppets of various vast and impersonal forces, and
  • a way of thinking about the canon in relation to the whole of literary culture.
That’s what I want. Those requirements imply having a causal model. Whether or not I’ll get it, that’s another matter.

Current critical approaches, however, in which individual humans are but nodal points in the machinations of vast and impersonal hegemonic forces, have trouble on all these points. Individual human beings are deprived of agency thus turning readers into zombies watching the ghosts of dead authors flicker on the remaining walls of Plato’s cave. The canon is captive to those same hegemonic forces, which have promulgated Shelley’s defense as an opiate for the masses, which R’ us.

The critical machine is broken. It’s time to start over. Before we do that, however, I need to dispense with one objection to seeking an evolutionary account of cultural phenomena.

Thursday, August 28, 2014

Reading Macroanalysis 7.2: Hyperobjects and Large Finitude

As a way of transitioning to a concluding post in which I will recast Jockers’ work in evolutionary terms, I want to look at a passage from Tim Morton’s Hyperobjects, a book I worked through several weeks ago. Why? Because the “big data” techniques Jockers’ is using deal in hyperobjects.

When Jockers built a model of literary influence in the 19th Anglo-American novel, he constructed a model of a hyperobject: roughly, a collective mentalité or Geist unfolding through several interlinked populations over the course of a century. And the model he built took the form of a graph with 3,346 nodes (each representing a text) and 165,770 edges (each a similarity relationship between the connected texts) where each node is a point in space with 578 dimensions, with each dimension scoring a single stylistic or thematic feature of the text. That model is big, but it is also finite.

And that’s where Morton comes in, with a passage on large finitude (Hyperobjects, pp. 60-61):
These gigantic timescales are truly humiliating in the sense that they force us to realize how close to Earth we are. Infinity is far easier to cope with. Infinity brings to mind our cognitive powers, which is why for Kant the mathematical sublime is the realization that infinity is an uncountably vast magnitude beyond magnitude. But hyperobjects are not forever. What they offer instead is very large finitude. I can think infinity. But I can’t count up to one hundred thousand. I have written one hundred thousand words, in fits and starts. But one hundred thousand years? It’s unimaginably vast...

The philosophy of vast space was first opened up by Catholicism, which made it a sin not to suppose that God had created an infinite void. Along with the scholastic view of substances, Descartes inherited this void and Pascal wrote that its silence filled him with dread. Vast non-human temporal and spatial magnitudes have been physically near humans since the Romantic period, when Mary Anning discovered the first dinosaur fossil (in 1811) and natural historians reckoned the age of Earth. Yet it was not until Einstein that space and time themselves were seen as emergent properties of objects. The Einsteinian view is what finally gave us the conceptual tools to conceptualize the scope of very large finitude.
I’ve come to believe that is one of the most important passages in the Morton’s book.

Wednesday, August 27, 2014

Reading Macroanalysis 6.4: Themes and how they evolve over time

Note: This may be the most important post in the series. But it’s a long way through, 6000 words or so. Fortunately, there are a lot of illustrations, and much of the argument is in those illustrations.
This will concludes my examination of the “Theme” chapter from Matthew Jockers, Macroanalysis: Digital Methods and Literary History. First I say a word or three about topic modeling. Then I review Jockers’ own findings. Then I ride one of my current hobby horses, Leslie Fiedler’s argument in Love and Death in the American Novel. First I look the waxing and waning of themes through the 19th century and then I return to Moby Dick and move on briefly to The Adventures of Huckleberry Finn. I conclude with some informal remarks about argument, evidence, epistemology, and interface design.

Operationalizing the Idea of Theme

When I was originally thinking about this post I decided that, since I’d already explained topic modeling elsewhere (e.g. HERE), there was no point in doing it again. But I’ve now decided to do it again just to make the point that we’re operating in an “operationalized” intellectual world and we must be aware of that.

For example, while Jockers has identified 500 themes (a term in ordinary language) in his corpus of 3346 novels, it would be somewhere between misleading and outright mistaken to say that he discovered 500 themes. For that would imply that it could have been 386 or 617 or 239 or any other number of themes, but that, no, it turns out that there are 500 of them, no more, no less.

Jockers has 500 themes because he ‘instructed’ his algorithm to prepare that many. He could have instructed it to prepare any number of topics (a term of art in corpus linguistics) he wished, 386, 617, or 239, for example. There’s no discovery involved. What’s involved is more like tuning. Jockers explored various possibilities and 500 seemed like a useful number of topics.

What’s going on?

Topic analysis depends on the fact that the words used to state some theme are going to occur together in any text where that theme shows up. It’s all about context. That fact isn’t of much use if you’re dealing with only a handful of themes in a small body of text. But if you’ve got a large body of texts with an appreciable number of themes, then there’s a way you can get a computer to list the words in each theme, more or less.

If a given text doesn’t contain a given topic then the words associated with that topic won’t appear in that text, obviously. Oh, some of them might, but not all of them. So, the computer does a massive comparison of the words that occur throughout the corpus. Just how it does this is irrelevant at the moment, as least, it’s irrelevant if you’re willing to trust that the researchers who invented the technique know what they’re doing. One interesting thing about the technique is that the results improve as the corpus gets larger–assuming you’ve got the computer power needed to crunch the data. That’s because a larger corpus will have a larger number of topics and each different topic will appear in contrast to a larger universe of topics. It’s the contrast that the computer’s looking for.

But, as I’ve said above, in order to run the algorithm, you’ve got to specify how many topics you’re looking for. If the number is too high the algorithm returns (p. 128)
topics lacking enough contextual markers to provide a clear sense of how the topic is being expressed in the text; setting the number too low may result in topics of such a general nature that they tend to occur throughout the entire corpus.
But you don’t necessarily want to run your algorithm against the entire text as a single analytical unit if the texts are individually large, as is the case with novels. What happens then is that just about any topic can be found in a given text. Jockers determined that he needed to slice his texts into 1000 word chunks. That is, each whole novel is searched, but the algorithm treats each 1000-word segment of a novel as an independent text for the purpose of determining word co-occurrence.

This has the benefit that it is possible to track the distribution of a given topic in a novel (Jockers gives examples on pp. 142-43). Thus topic modeling returns useful results both at the macro scale of the whole corpus and the meso scale of the individual text (on scales, see Reading Macroanalysis 5: An Interlude on Scale: Micro, Meso, and Macro in this series).

What the algorithm returns for each topic is a weighted list of words for that topic, where the weighting of a word is proportional to its frequency in the topic. Jockers’ preferred representation for a topic is a word cloud, such as this one:

TENANTS AND LANDLORDS

But it’s not convenient to use those clouds as devices for referring to a given topic. For that purpose it’s useful to have names. The general practice is to devise a name from the words most prominent in the topic. Thus, Jockers has called that topic TENANTS AND LANDLORDS.

Saturday, August 23, 2014

Reading Macroanalysis 6.3: DOGS and BIRDS, or, the hermeneutics of screwing around

I started exploring Jockers’ 500 themes in earnest after I’d spotted this graph, which depicts the occurrence of the DOGS topic by gender over time:

Dogs Gender year

I saw that spike at the end, for male authors, and thought Call of the Wild. But alas, that can’t be, as the book isn’t in the database.

In any event, I’ve just now been looking through some more of theme plots and found with this one, for BIRDS by gender over year:

BIRDS g y

Bingo! And I wasn’t even looking for it!

The overall shape is the same as the one for DOGS, including the spike in the male line. Are we looking at the same book or books in that spike?

Friday, August 22, 2014

Reading Macroanalysis 7.1: Visualizing the Geist of 19th Century Anglo-American Literary Culture

Last year Alan Liu published a remarkable essay, “The Meaning of the Digital Humanities” (PMLA 128, 2013, 409-423), in which he argued that the most recent work in digital humanities has come within hailing distance of operationalizing the vague humanistic notion of the human spirit conceived as a collective entity operating in and through world history. To be sure, that’s not quite what he said and did, but that’s what his discussion of Ryan Heuser and Long Le-Khac, A Quantitative Literary History of 2,958 Nineteenth-Century British Novels (May 2012, 68 page PDF), amounts to. In effect their corpus is a proxy for the 19th Century British Geist, making their analysis of that corpus an analysis of that Geist.

We find this paragraph near the end of his essay (pp. 418-419):
It is not accidental, I can now reveal, that at the beginning of this essay I alluded to Lévi-Strauss and structural anthropology. Structuralism is a midpoint on the long modern path toward understanding the world as system (e.g., as modes of production; Weberian bureaucracy; Saussurean language; mass, media, and corporate society; neoliberalism; and so on) that has forced the progressive side of the humanities to split off from earlier humanities of the human spirit (Geist) and human self to adopt a worldview in which, as Hayles says, “large-scale multicausal events are caused by confluences that include a multitude of forces . . . many of which are nonhuman.” This is the backdrop against which we can see how the meaning problem in the digital humanities registers today’s general crisis of the meaningfulness of the humanities. The general crisis is that humanistic meaning, with its residual yearnings for spirit, humanity, and self—or, as we now say, identity and subjectivity—must compete in the world system with social, economic, science-engineering, workplace, and popular-culture knowledges that do not necessarily value meaning or, even more threatening, value meaning but frame it systemically in ways that alienate or co-opt humanistic meaning.
Those “large-scale multicausal events” are, in effect, the workings of Geist. The fact that we can now visualize proxies for that collective spirit, that Geist, and do so in an explicit and rigorous way means that we are now in a position to recoup that older humanistic learning, albeit in a new and strange register.

In the final analytic chapter of Macroanalysis Jockers presents two visualizations of the Geist of 19th Century Anglo-American literary culture. Here’s one of them:

9dot3

Though it is difficult to see at this crude resolution, that cloud is in fact a graph with 3,346 nodes and 165,770 edges. Each node represents one of the texts in Jockers’ corpus and the edges represent a measure of similarity between nodes. Each text has been scored on 578 stylistic and thematic features and similarity computed accordingly.

Thursday, August 21, 2014

Reading Macroanalysis 6.2: Theme, Moby Dick in the Context of Literary Culture

In this post, which is a long one, I use two books to investigate Jockers’ themes, and vice versa. One of the books is a classic of fairly traditional, at least by now, literary criticism, Leslie Fiedler’s Love and Death in the American Novel. The other book is one that Fiedler examined, Herman Melville’s Moby Dick.

First I open with some of Jockers' charts, then I consider a particular passage from Moby Dick, and Fiedler’s gloss on it. I then return to Jockers, looking for themes and charts that resonate with or respond to the passage. What I’m up to is, in effect, using Fiedler to guide me in a close reading of Jockers’ distant reading, a term, by the way, that he doesn’t use. I’m using it, of course, for rhetorical purposes.

Note: I downloaded all the charts from Matt Jockers’ Macroanalysis website.

An Outlier in the 19th Century

Look at this graph:

PACIFC ISLANDS AKA SEAS AND WHALING nation year

That spike in the middle is Moby Dick. Not literally of course. But it is reasonable to believe that that spike in the data is caused mostly by Moby Dick–where cause is to be read as Aristotelian formal cause. The timing is right; Moby Dick was published in 1851, which is roughly where that spike peaks. The light grey line shows the use of the topic by Irish authors in the 19th century; the darker grey line shows use by British authors. And the black line, of course shows American authors.

And the topic is typical of Moby Dick. In his book Jockers calls it Seas and Whaling; on his website it’s PACIFC ISLANDS AKA SEAS AND WHALING. Here’s the corresponding word cloud:

PACIFC ISLANDS AKA SEAS AND WHALING cloud

Above island and islands in the middle you can see whale while near the top in the middle you can see Queequeg.
Note: for reasons that Jockers explains in the book, proper names are generally eliminated from topic models because they can recur in many contexts without there being any thematic similarity between those contexts; this one slipped through.
This particular topic constitutes almost 20% of the text of Moby Dick, though it constitutes a very small faction of the corpus as a whole (p. 130 and Figure 8.4 p. 132). Other topics having to do with ships and the sea are also prominent in the book while being relatively unimportant in the whole corpus of 3.346 British, American, and Irish novels.

Moby Dick is, as Jockers says, an outlier. It is also one of the world’s great novels. Back in 1976 Edward Mendelson included it as one in a very small and exclusive genre he called the encyclopedic narrative (“Encyclopedic Narrative: From Dante to Pynchon,” MLN 91, 1267-1275). The other examples are: Dante’s Divine Comedy, Rablais’ Gargantua and Pantagruel, Cervantes’ Don Quixote, Goethe’s Faust, Joyce’s Ulysses, and Pynchon’s Gravity’s Rainbow. In Mendelson’s account each of these works is unique in its national canon and each is encyclopedic in scope, encompassing the world as it was known at the time. (Mendelson does have a ‘fix’ for the fact the America gets two such books; he links Gravity’s Rainbow to a newly forming international culture.)

Wednesday, August 20, 2014

Reading Macroanalysis 7: Influence, or the evolving dynamic integrity of the aesthetic sphere

While I do intend to write two more posts (at least) on thematic analysis, yesterday’s effort burned me out temporarily. So today’s post is a somewhat shorter one on Jockers’ last substantive chapter.
I opened this investigation of Macroanalysis with the following paragraph:
The book arrived midway last week, when I hadn’t even finished reading Tim Morton’s Hyperobjects, much less finished blogging about it. But that didn’t stop me from giving Macroanalysis a look-thru: contents, some of the figures, read a bit here and there. I ended up reading Chapter 9, “Influence”, first; I’d read Matt Wilkins’ review in the LA Review of Books:
It’s a nifty approach that produces a fascinatingly opaque result: Tristram Shandy, Laurence Sterne’s famously odd 18th-century bildungsroman, is judged to be the most influential member of the collection, followed by George Gissing’s unremarkable The Whirlpool (1897) and Benjamin Disraeli’s decidedly minor romance Venetia (1837). If you can make sense of this result, you’re ahead of Jockers himself, who more or less throws up his hands and ends both the chapter and the analytical portion of the book a paragraph later.
Would I be able to make sense of those results? thought I to myself as I read. Nope, I couldn’t. Better luck next time.
I am now prepared to offer a re-interpretation of those results. But before I do that I need to explain more or less what Jockers is doing in this the final analytical chapter of the book. How does he operationalize the concept of influence?

When we say that, for example, that J. K. Rowling was influenced by the Narnia novels of C. S. Lewis, what do we mean? We mean that she read them and has incorporated features of those books into her own work. There is a direct relationship between Rowling’s activities and those influential books.

Influence thus understood is something that ‘travels’ along certain paths in the enormous meshwork of reading and writing transactions that constitute literary culture. As there are only a relatively few writers in that network, and only a relatively few of their transactions are writing ones (let’s say that the writing of a book is a single transaction) most of the transactions in the network are readings. Only a few of the transactions in the meshwork carry influence.

But Jockers doesn’t have access to that meshwork. None of us does. To be sure, we can see bits and pieces of it here in there in diaries, letters, and published reviews, but most of the transactions are lost to history. We can only look for the effects of those transactions.

And that’s what Jockers does. He assumes, reasonably enough, that if one author is influenced by another, then we should see indicators of that influence in the work. There should be a noticeable resemblance between those works.

Tuesday, August 19, 2014

Reading Macroanalysis 6.1: Theme–Dogs, Gold, Slavery, and Awakening

Chapter 8 of Macroanalysis is about “Theme.” Jockers uses topic analysis to investigate the occurrence of 500 ‘themes’ in a corpus of 3,346 19th-century British, American, and Irish books. He opens with a bit of intellectual history, from the Russin Formalists to Google’s Ngrams; then he launches into topic analysis, which emerged at the turn of the millennium he gives some simple examples, and then he gets serious.

But I’m going to skip over all of that for now. For one thing, I’ve been through the topic analysis drill several times in the past year or so and don’t want to go through it again. If you need an introduction or a review, check out Topic Models: Strange Objects, New Worlds, or, in this series, Reading Macroanalysis 5: An Interlude on Scale: Micro, Meso, and Macro. For another, Jockers has put a topic tool online, 500 Themes from a corpus of 19th-Century Fiction. Those are the topics he discusses in this chapter.

Once I was done reading the chapter I started playing with the tool. I’d pick a topic and then look at the graphics:
  1. a word cloud to display the most frequent words in the topic,
  2. a bar chart indicating usage of the topic by author gender (male, female, and undetermined),
  3. a line graph showing gender usage over time,
  4. a bar chart indicating usage of topic by author nationality (American, British, Irish).,and
  5. a line graph showing national usage over time.
At first I was just browsing, moving from one theme to the next. But then I hit one that grabbed my attention. So I spent the next couple of hours looking at themes and thinking about them.

I’m going to devote the rest of this post and the next one showing what I found. Then I’ll do a third post where I review what Jockers found and recast the enterprise in terms of cultural evolution. Note that in all of this I’m just playing around, but in a serious way. It is all preliminary and provisional. I haven’t reached any firm conclusions on the particular themes I look at. The only thing I’m sure about is that this, and similar techniques, are going to revolutionize the way we do literary history.

Before proceeding on, however, two caveats are necessary. While the Jockers’ is substantial it isn’t every British, American, and Irish novel written in the 19th Century. Perhaps more important, it is natural to read these theme charts as reflecting the interests of the 19th Century reading public. And in some sense that is so.

But we have to be careful. For some of these books were more widely read than others and a few of them, the canonical ones, are still being read. But the extent of a books’ readership is not reflected in the data. The fact that a book was published at all implies, of course, that someone thought there was an audience for it. But a publisher’s interest isn’t quite the same as a reader’s interest. We simply don’t know how accurately publisher interest tracks reader interest.

With those reservations in mind, let’s take a look.

Of Dogs and Gold

In the course of browsing through Jockers’ themes menu I saw “DOGS.” Let’s look at that, I thought. Why dogs? you may ask. No deep reason, but some years ago, way back in graduate school in fact, I’d noticed that dogs figured as a significant motif in Wuthering Heights. Major transitions among humans were marked by violence between dogs and humans (e.g. Lockwood arrives and is greeted by a barking dog, Catherine gets bitten by Skulker; see this post). More recently, I’d read a handful of articles about the domestication of dogs during human evolutionary history. I was just curious.

Here’s the word cloud for the DOGS topic:

dog cloud

Friday, August 15, 2014

Reading Macroanalysis 5: An Interlude on Scale: Micro, Meso, and Macro

Before moving on to the last two major chapters, “Theme” (8), and “Influence” (9), I want to pause a bit and think about scale as I discussed it Toward a Computational Historicism. Part 1: Discourse and Conceptual Topology and a consequent Working Paper on Digital Historicism. While Jockers focuses on the macroscale, large populations of texts, he is also working at the mesoscale or the individual text, and his analytical work implies microscale phenomena.

From Micro to Meso: Paths in a Network

It has become common for cognitive scientists to think of the mind as a cognitive, or associative, network:

network

The nodes represent concepts, or even words, while the arcs or edges represent relations. We can now think of an utterance or a written statement as taking a path through the network:

path

An entire novel is simply a very long path, one that will pass through a large area of the network and that will go through various subnetworks many times, often along different paths and orientations.

Let us posit that the way an author moves through the net is the author’s style. The function words that are so very useful as features in identifying style can be thought of as regulating movement from one node (content) to another. That’s why they are so very useful in stylistic analysis.

No matter what you write about, you have to use those function words. Your choice of content words varies as your topic varies, but the set of function words is quite limited and you have to draw from that set regardless of topic. That is to say, while an author has to navigate many different subnetworks, that is, many different topics, the way they navigate each subnetwork is pretty much the same.

But why should that signal also be an individual one? Because, as the saying goes, there are many ways to skin a cat. That is, there are many ways to construct sentences and paragraphs about any given topic. Different writers will choose different ways of so doing.

Wednesday, August 13, 2014

Reading Macroanalysis 4: On the matter of “the”

Chapter 7, “Nationality” is pretty straightforward. I don’t have much to say about it except for a puzzle that Jockers presents at the beginning. He points out that, because British and American writers have different practices concerning the word the, that word is about 5 percent of the word tokens in his corpus of 19th Century British novels, while it is about 6 percent of the tokens in the American novels.

That is a trivial and straightforward matter. Now, when you plot yearly usage throughout the century you find that British and American usage track one another fairly well (p. 106):
Over a period of one hundred years, it is as if the writers in these two nations–two nations separated by several thousand miles of water in an age before mass communication–made a concerted effort to modulate their usage of this most common of common words. However, the is not the kind of word that authors would consciously agonize over; quite the contrary, the is a trivial word, a function word used automatically and by necessity. Whereas the use of the word beautiful, for example, may come and go with the fads of culture, the word the is a whole different animal. For comparison, consider figure 7.2, which charts the relative frequency of the word beautiful in the same corpus. Whereas the is nearly parallel, beautiful is erratic and unpredictable.
Which rather seems to me to be the point. The use of the is fixed by grammar and the grammars of British and American English are pretty much the same. And, while word usage can come and go relatively quickly, grammar changes slowly. So I’m not terribly puzzled by the fact that American and British usage of the track one another fairly well during the 19th Century, nor that their usage of beautiful is different.

Tuesday, August 12, 2014

Reading Macroanalysis 3.1: Style, or Measuring the Autonomous Aesthetic Realm

Yesterday I posted on Chapter 6, “Style,” in which Jockers argued, in effect, that insofar as we can measure (or estimate) the factors that affect a text’s style, authorial identity is the strongest of those factors. The key is that phrase, “insofar as we can measure,” because that’s the intellectual world in which we are now functioning.

I now want to take Jockers’ arguments on that score and refit them for use as evidence that an autonomous aesthetic realm does indeed exist, as the late Edward Said believed but couldn’t quite explain.

First I want to take up the topic that was the subject of Moretti’s most recent pamphlet, operationalization. Then I’ll introduce Said’s conundrum about the existence of an autonomous aesthetic realm and discuss how we could operationalize it. I’ll conclude by arguing that Jockers has already, in effect, all but given us an operationalization of it.

Operationalization

If we can’t measure it, or operationalize it, to use a term Moretti adopted from physics (“Operationalizing”: or, the Function of Measurement in Modern Literary Theory, Literary Lab, Pamphlet, December 2013) then we can’t reason about it in this universe of discourse. In some other universe of discourse, sure, but not in this one.

Moretti glosses “operationalize” by a passage from P.W. Bridgeman (p. 2):
We may illustrate [the meaning of the term] by considering the concept of length: what do we mean by the length of an object? [...] To find the length of an object we have to perform certain physical operations. The concept of length is therefore fixed when the operations by which length is fixed are fixed: that is, the concept of length involves as much and nothing more than the set of operations by which length is determined. In general, we mean by any concept nothing more than a set of operations; the concept is synonymous with the corresponding set of operations [...] the proper definition of a concept is not in terms of its properties but in terms of actual operations.
If I might push the term a bit, since the middle of the last century, and a bit before, literary criticism has ‘operationalized’ the concept of meaning by the procedure of so-called close reading. That is to say, the meaning of a literary text, whether a sonnet by Shakespeare or a narrative by Murasaki Shikibu, is what the procedure of close reading determines it to be.

Alas!, or if you are so inclined, mirable dictu!, the result of this process tends to vary from one critic to another. In science, that would be a problem. In literary criticism it is merely a provocation to theory.

Many critics simply ignore it, perhaps covering it over with the anodyne topos that the multiplicity of meanings simply shows the richness of the text. Other critics assume that the concept has been inadequately operationalized and go in search of more adequate methods – the literary Darwinists, led by Joseph Carroll, are the most recent such school. And still other critics accept this as evidence that meaning is indeterminate.

But I digress. It’s not meaning we’re after. It’s style. Traditionally the concept of style has been operationalized – and here I’m again pushing things a bit – by describing texts in rhetorical, philological and, more recently, linguistic terms. Since the middle of the previous century computational humanists have operationalized the concept by counting textual features and undertaking a statistical analysis of the counts.

From the standpoint of traditional humanism that seems odd and terribly impoverished, and no doubt it is. But it is also reliable from one researcher to another and has allowed stylisticians to accomplish at least one task beyond the reach of traditional humanists, with their richer methodology. Namely, identifying the authors of otherwise anonymous texts. This is the tradition in which Jockers is working.

Monday, August 11, 2014

Reading Macroanalysis 3.0: Style, or the Author Comes Back from the Dead

I’m going to devote two posts to Chapter 6, “Style.” In this post I’m going to present what I take to be Jockers’ main result, that authorial identity is, in fact, a strong feature of texts. That may not come as much of a surprise to most as it’s something that many of us have “known” for a long time. But we’ve not known it in the context of an investigation of this kind.

In my second post I’m going to offer some general methodological remarks about operationalization and then discuss on Jockers’ results can be brought to bear on Edward Said’s anxiety over the existence of an autonomous aesthetic sphere.

The Statistical Assessment of Style

Jockers begins (p. 63) with this statement:
In statistical of quantitative authorship attribution, a researcher attempts to classify a work of unknown or disputed authorship in order to assign it to a known author based on a training set of works of known authorship.
So we start with texts whose authorship is known and analyze them in some way to identify features thought to be characteristic of the particular author. We’re not interested in the content of the text, but in authorial style.

Such work has a substantial history, going back to the mid 1960s when Mosteller and Wallace used statistical techniques to identify the authors of fifteen (out of 85) Federalist Papers with uncertain authorship. Because this particular set of texts is relatively homogenous in important respects – genre, time of composition, provenance – it is reasonable to attribute statistical differences in the texts to authorial style.
Such homogeneity is not always the case, as Jockers notes (p. 63):
A consistent problem for authorship researchers, however, is the possibility that other external factors (for example, linguistic register, genre, genre, nationality, gender, ethnicity, and so on) may influence or even overpower the latent authorial signal. Accounting for the influence of external factors on authorial style is an important task for authorship researchers, but the study of influence is also a concern to literary researchers who wish to understand the creative impulse and the degree to which authors are the products of their times and environments.
Jockers will go on to discover that authors are in fact highly constrained, which I don’t think comes as a surprise to anyone except, perhaps, for a few sophomoric Romantics who are totally besotted with the trope of authorial creative genius.

Friday, August 8, 2014

Reading Macroanalysis 2.1: How do we make inferences from patterns in collections of books to patterns in populations of readers?

My previous post, about Jockers’ analysis of metadata from a collection of Irish American texts, got me to thinking about just what kind of inferences we CAN make given such data. In particular, how do we go from data about collections of books to thoughts, attitudes, desires and values in the minds of populations of readers? For, in one way or another, that is what both Jockers and I are doing when we propose explanations for the data.

Texts as Proxies for Minds

As a practical matter we, literary critics interested in literary history, treat texts as evidence about the minds of past populations. And one of the major justifications for computational analysis of large collections of texts is that traditional criticism, by focusing on a small number of canonical texts, is not adequately sampling the textual universe and thus is not getting a full picture of mentalities past. But what kind of conclusions can we draw from the texts themselves? Does it matter that some texts were more widely read than others? I should think it does.

At least proponents of the canon can point out that those texts have survived and are read today precisely because they have been read by many in the past. The same cannot necessarily be said about those many now-forgotten texts. We need not take the traditionalist claim at face value, of course, but we cannot dismiss it either. Readership matters, and canonical status is an index of readership, albeit a problematic one.

What to do?

I want to think this through a bit. Not in detail at all. Just thinking out loud.

Start with Oral Cultures

In the middle of my previous post I presented an account of the social function of literature in terms of shared vs. mutual information, a notion from game theory. Publicly told stories embody values and attitudes that are widely circulated in a population, but everyone knows this. That, knowing that everyone knows, is a kind of meta-knowledge.

The basic story telling situation, the one deepest in our history, is face-to-face story telling in an oral culture. Such societies have relatively small populations, are culturally homogeneous, and you spend most of your time with people you know very well. This is a world with relatively few secrets and surprises.

Wednesday, August 6, 2014

Reading Macroanalysis 2: Metadata and the Emperor’s New Clothes

Once I’d gone through the introductory chapters I decided to skip over Chapter 5, “Metadata”, in favor of the real goods in Chapter 5, “Style”. But I was wrong about the metadata, so here we are.

You may want to get yourself a cup of Irish coffee as this is going to take awhile.

The Lost are Found

Most of Jockers’ work in this chapter concerns Irish American fiction in a corpus of 758 texts covering 250 years. Jockers is curious about a gap identified in Charles Fanning, The Irish Voice in America: 250 Years of Irish-America Fiction (p. 38):
Fanning discovered an apparent dearth of writers active in the period form 1900 to 1930, and as an explanation for this literary “recession,” Fanning proposes that 1900 to 1930 represents a “lost generation,” a period he defines as one “of wholesale cultural amnesia”.
When Jockers plots his 758 texts, however, he can’t see Fanning’s gap.

First he separates writers working east of the Mississippi from those working west and plots the two groups separately. The eastern group follows Fanning’s pattern, but the western group does not. They
make a somewhat sudden appearance in about 1900 and then begin a forty-year ascendance that reaches an apex in 1941. Western writers clearly dominated the early part of the twentieth century. (p. 39)
Jockers also plots titles by gender, which adds nuance to the emerging picture.

He then gives us this striking paragraph (p. 42):
Western authors, both male and female, certainly appear to have countered any literary recession of the East. That they succeed in doing so despite (or perhaps directly because of) a significantly smaller Irish ethnic population in the West is fascinating. Figure 5.6 incorporates census figures to explore Irish American literary output in the context of eastern and western demographics. The chart plots Irish American books published per ten thousand Irish-born immigrants in the region. A natural assumption here is that there should be a positive correlation between the size of a population and the number of potential writers within in it. What the data reveal, however, is quite the opposite: the more sparsely populated West produced more books per capita.
Jockers goes on to point out that not only were the western writers distant from the publishing industry, they “were also further removed from the primary hubs of Irish culture in cities such as Boston, New York, and Chicago” (42-43). In this situation, Jockers suggests, you’d expect the immigrants to shed their Irishness and assimilate to local cultures. Instead, it seems, they doubled-down and “wrote about being Irish in America at a per capita rate exponentially greater than their countrymen in the East” (p. 43).

Now we’ve got something to think about. And what I’m going to think is that perhaps Jockers’ is working from a plausible, but incorrect, assumption. But now I’m getting ahead of myself.

Tuesday, August 5, 2014

Reading Macroanalysis 1: Framing: Hyperobjects, Objectification, and Evolution

Matthew L. Jockers. Macroanalysis: Digital Methods & Literary History. University of Illinois Press, 2013. x + 192 pp. ISBN 978-0252-07907-8
The book arrived midway last week, when I hadn’t even finished reading Tim Morton’s Hyperobjects, much less finished blogging about it. But that didn’t stop me from giving Macroanalysis a look-thru: contents, some of the figures, read a bit here and there. I ended up reading Chapter 9, “Influence”, first; I’d read Matt Wilkins’ review in the LA Review of Books:
It’s a nifty approach that produces a fascinatingly opaque result: Tristram Shandy, Laurence Sterne’s famously odd 18th-century bildungsroman, is judged to be the most influential member of the collection, followed by George Gissing’s unremarkable The Whirlpool (1897) and Benjamin Disraeli’s decidedly minor romance Venetia (1837). If you can make sense of this result, you’re ahead of Jockers himself, who more or less throws up his hands and ends both the chapter and the analytical portion of the book a paragraph later.
Would I be able to make sense of those results? thought I to myself as I read. Nope, I couldn’t. Better luck next time.

I then read though the first four chapters, gathered together as Part I: Foundation (“Influence” ended Part II: Analysis). OK, I’ll go along with most of that, but... I skipped over Chapter 5, “Metadata” and dug into Chapter 6, “Style”. Hmmm, thought I to myself, if you recast the analysis in terms of cultural evolution, you might be able to frame an argument for the autonomous aesthetic realm, though Jockers frames the discussion as constraints of the author. And when I went back to the “Metadata” chapter, wouldn’t you know it, I saw another opening for an evolutionary formulation.

And that’s about where I am now. I’ve read the short coda, “Orphans”, where Jockers expresses ambivalence about cultural evolution, and I’ve got two substantive chapters to go, “Nationality” (ch. 7) and “Theme” (ch. 8). But I really need to get blogging.

As the title suggests, this post is preliminary. I’m not going to say much about Jockers’ specific arguments. Rather, I want to do a bit of framing.

The Scope of the Humanities

One can hardly imagine two such different examples of contemporary humanistic thought as Macroanalysis and last week’s book, Tim Morton’s Hyperobjects. Morton is working within an Anglophone Continental discourse with roots in Hegel, Heidegger, and post-structuralist philosophy and Theory while Jockers’ methodology is grounded in humanistic computing, corpus linguistics, and social science. If you were to cross-match their bibliographies, you wouldn’t find many texts in common. Further, while Morton is trained as a literary critic, and has a lit crit job at Rice, Hyperobjects is not literary criticism. It’s philosophy and cultural criticism. Jockers is all literature.

Such is the contemporary scope of the humanities.