NEW SAVANNA: Toward a Theory of the Corpus [#DH]

Monday, December 31, 2018

Toward a Theory of the Corpus [#DH]

I've just posted a new working paper. Title above, abstract, table of contents, introduction and section introductions are below. Download from:

Academia.edu: https://www.academia.edu/38066424/Toward_a_Theory_of_the_Corpus_Toward_a_Theory_of_the_Corpus_-by.
SSRN: https://ssrn.com/abstract=3308601.

Abstract: Recent corpus techniques ask literary analysts to bracket the interpretation of meaning so that we may trace the motions of mind. These techniques allow us to think of the mind as being, in some aspect, a high-dimensional space of verbal meanings. Texts then become paths through such a space. The overarching argument is that by thinking of texts as just ordered collections of physical symbols that are meaningless in themselves we can examine those collections in ways that allow us to recover the motions of mind as it constructs meanings for itself. When we examine a corpus over historical time we can see the evolution of mind. The corpus thus becomes an arena in which we investigate the movements of mind at various scales.

Contents

Meaning, text, and mind: Notes toward a theory of the corpus 2
PART 1: MAPPING A NEW ONTOLOGY OF THE TEXT 4
1. Can you learn anything worthwhile about a text if you treat it, not as a TEXT, but as a string of marks on pages? 6
2. Computational linguistics & NLP: What’s in a corpus? – MT vs. topic analysis 13
3. Why computational critics need to know about constitutive computational semantics 18
PART 2: VIRTUAL READING: PATHS THROUGH THE MIND AND THE MIND OVER HISTORICAL TIME 21
4. Augustine’s Path, A note on virtual reading 23
5. Mapping the pathways of the mind 29
6. Inferring the direction of the historical process underlying a corpus 39

Meaning, text, and mind: Notes toward a theory of the corpus

The set of observations I’ve collected in this working paper has two sources; it was spun out to scratch two conceptual itches. One is my long-standing interest in literary form. The other is the opposition or tension between meaning and, well, computation that has been dogging computational criticism for, I don’t know, a decade. Even computational critics who otherwise refuse to take that opposition as a criticism nonetheless tend to treat their mathematical models as scaffolding to support, or as gadgets for detecting, what really interests them. In the end I spent more time scratching that second itch than the first.

Computational critics have an opportunity to map the human mind that is qualitatively different from what interpretive critics accomplish by uncovering meanings ‘hidden’ in literary texts. But to avail themselves of this opportunity computational critics must understand the broad disciplinary framework in which “meaning” is opposed to “distant reading”. It is not simply that these are two different phenomena, or that “distant reading” is not intended to replace or supplant the explication of “meaning”, but that yoking them together in that opposition makes no more sense than opposing “salt” to “NaCl”.

The first three sections – Part 1: Mapping a new ontology of the text – deal with that kind conceptual difference. The last three sections – Part 2: Virtual reading: Paths through the mind and the mind over historical time – are about those new conceptual possibilities. I’ve provided some introductory material for both parts that is intended to help stitch these various arguments together. The overarching argument is that by thinking of texts as just ordered collections of physical symbols that are meaningless in themselves we can examine those collections in ways that allow us to recover the motions of mind as it constructs meanings for itself. We bracket the interpretation of meaning so that we may trace the motions of mind.

* * * * *

Part 1: Mapping a new ontology of the text – My overall objective here is to outline a way of thinking about language and texts that is centered on form and mechanism (linguistics) rather than meaning (literary criticism).

1. Can you learn anything worthwhile about a text if you treat it, not as a TEXT, but as a string of marks on pages? – Conventional literary criticism talks a lot about the text, but has no coherent conception of it. That is because it is focused on meaning and meaning doesn’t exist in the marks on pages, the physical text. Corpus techniques, topic modeling for example, have nothing but those marks and yet manage to reconstitute something that looks like meaning (but really isn’t, not quite). How is that possible? Moreover, by focusing on certain kinds of patterns in those marks, we can uncover formal structure in texts, structure that is otherwise invisible to conventional criticism, which also talks a lot about form without offering a coherent account of it.

2. Computational linguistics & NLP: What’s in a corpus? – MT vs. topic analysis – Corpora play very different roles in topic modeling and in machine translation. In topic modeling a corpus is the object of investigation while in machine translation a corpus is used to build a tool which then, in turn, does the translation. In MT the corpus allows us to create that Martin Kay calls an “ignorance model”. We would really like to be able to us a robust account of natural language semantics in MT; alas, we don’t have such a model (ignorance), so we use corpus techniques to construct a very crude approximation of semantics.

3. Why computational critics need to know about constitutive computational semantics – Simple, you need to know the lay of the land. That can be expressed in four contrasts: 1) close reading vs. distant reading, 2) meaning vs. semantics, 3) statistical semantics vs. computational semantics, and 4) corpus as tool vs. corpus as object. More often than not, corpus as tool is a substitute for constitutive computational semantics.

Part 2: Virtual reading: Paths through the mind and the mind over historical time – Assuming that we can think of the mind as, in some aspect, a high-dimensional network of verbal meanings, we can use statistical techniques to reveal the paths different texts trace through the mind and, beyond that, follow the mind as it evolves over historical time.

4. Augustine’s Path, A note on virtual reading – If we think of the mind as a high-dimensional space that can be approximated by statistical techniques, including those in analyzing texts, then we can see Andrew Piper’s statistical analysis of conversion texts, chiefly Augustine’s Confessions, as an analysis of mental structure. The statistical structure uncovered in the location of the 13 books of the Confessions can thus be reinterpreted as a pathway in the mind, of Augustine, but also of his readers. What are these different mental regions that are traversed in just this way?

5. Mapping the pathways of the mind – Michael Gavin uses vector semantics to examine a passage from Paradise Lost. After arguing that a word-space model is, after all, a model of the mind, I suggest that vector semantics could be used to map paths through the mind. I illustrate this conjecture by drawing a path for the Milton passage by picking words that had been brought to my attention by Gavin’s analysis. There’s no reason why such a path couldn’t be traced computationally.

6. Inferring the direction of the historical process underlying a corpus – Mathew Jockers’ final study in Macroanalysis (2013) attempted to investigate influence in a corpus of 3300 19th century novels. I argue that what he in fact discovered is that the socio-cultural process that created those novels is inherently directional. Without intending to do so, Jockers had in effect operationalized the 19th century idealist notion of Spirit and provided a way of thinking about “an autonomous aesthetic realm” (in a phrase from Edward Said).

* * * * *

PART 1: MAPPING A NEW ONTOLOGY OF THE TEXT

My objective in these three sections is to distinguish between two ways of thinking about language, one centered on meaning and favored by literary critics and one centered on form and mechanism and favored by linguists. Linguists and critics may agree on some level about the phenomenon ‘out there’ in the ‘real’ world that they’re talking about, but their conceptualizations are different. This is a matter of conceptual ontology, as it came to be understood in the study of knowledge representation back in the 70s and 80s. There is the common sense view of language, which we all more or less share (within a broadly given culture). And then there are the various specialized views held by different intellectual disciplines.

My standard example of this phenomenon is salt vs. sodium chloride. At some point in middle school, I believe, we learn that salt is the common name for sodium chloride. Which is true, almost. How is salt defined? I’m not so much interested in the dictionary definition as I am in the sensory characteristics that allow us to identify it. We know it primarily through its taste and secondarily through its appearance, as a white granular substance. Refined sugar is also a white granular substance and looks quite like salt, but tastes rather different – something deeply etched in my memory when, as a child, I decided to make myself a treat by slathering a piece of buttered bread with sugar. But I picked the wrong white granular substance.

Sodium chloride is defined quite differently. Oh, the dictionary may tell you it’s “common salt”, mine does. But I’m interested in the definition saying it “is an ionic compound with the chemical formula NaCl, representing a 1:1 ratio of sodium and chloride ions” (Wikipedia). “Ions”, “iconic compound”? What are they? Those concepts didn’t really exist until the 19th century. NaCl did, but not the concept “NaCl”.

To borrow a term from logic, NaCl/sodium chloride and salt are intensionally different. They also have slightly different extensions. NaCl is by definition a pure substance. But salt is almost never pure NaCl; it always contains impurities of some kind.

So it is with the concepts that different intellectual specialists use to talk about language and texts. That, in part, is what’s behind many of my remarks here. It seems to me that computational literary nonetheless come trailing conceptions from literary criticism – & how could they not? – which inhibit their thinking about what these new computational tools allow them to do. And the problem starts with such concepts as “word”, and “text”, and perhaps even “meaning” or perhaps “semantics”. Words, the commonsense notion, have pronunciations, spellings, syntactic affordances (parts of speech), various forms, and so forth. All of these are bundled into that one concept. Linguists unbundle them and have separate terms for various facets of the (commonsense) word.

Thus linguists do not use “word” as a technical concept and instead use various terms such as e. g. lexeme, morpheme, lemma, even sign, signifier, or signified, none of which is synonymous with the common sense notion of a word. I hesitate to say that literary critics use the common sense notion of word as a technical term, but for the most part the literary critic’s usage is more or less the commonsense term, sometimes decked out as a sign or signifier.

The commonsense notion, word, thus exists in one conceptual ontology while the various linguistic notions, lexeme, etc., exist in a different conceptual ontology. I’m not quite sure what to do about text. Texts are physical objects. For linguists they are strings of, of what? Lexemes, phonemes, characters, what? It doesn’t matter, they’re strings of items for which linguists have terms and concepts. For literary critics texts are, well, they’re individual literary works and they’re physical objects – neither of which are problematic for linguists. But they’re also repositories of whatever must be taken into consideration when ascertaining the meaning of, well, a text. Just what that is depends is depends on one’s critical method. There is in fact considerable discussion and not a lot of agreement about just what a text is.

And then there’s meaning, which became central to literary criticism after World War II and remains so. It’s secondary to linguistics and, again, linguists are more inclined to talk about semantics, which, I believe, is ontologically different from meaning. Meaning is, well, it’s meaning, that’s what it is. And literary critical accounts of it range all over the map, again, without any profession-wide consensus. Semantics is more specific...but, this is getting a bit long and rambling.

You get the point. Literary critics and (computational) linguists inhabit different conceptual universes. My sense is that the digital critics who work with various techniques from corpus linguistics are, in the end, fundamentally literary critics, with a literary critic’s sense of text and meaning. Their corpus work is thus this elaborate machinery bolted onto the side of their literary critical machine. They’re thinking in two different worlds and don’t (quite) realize it.

In the first section – Can you learn anything worthwhile about a text if you treat it, not as a TEXT, but as a string of marks on pages? – I concentrate on the ontological aspect. In this context corpus linguistics, topic modeling in particular, is used as an example of what you can do when thinking about the text simply as a bunch of signifiers disconnected from their signifieds and thus from meaning. My thinking is oriented toward my recurring hobby-horse, form, ring-composition in particular.

In the next section – Computational linguistics & NLP: What’s in a corpus? – MT vs. topic analysis – I introduce some remarks by Martin Kay, one of the grand old men of computational linguistics. He started at the beginning, in the world of symbolic computation for machine translation, and has remained active into the somewhat different world of statistical processing. He points out that the use of statistical methods amounts of a confession of ignorance about the connection between language and the world. When I rework this material in a more formal way I need to make the correlative point that such processing reveals the relational structure of the human mind (absent its connections with the world).

The third section – Why computational critics need to know about constitutive computational semantics – is an argument for learning about old-school computational semantics even if you are not going to be using it. The argument needs to be pushed just a bit farther, but it amounts to this: old school computational semantics, at its fullest, is about the connection between language and the world. It is about how the mind constructs those relationships though perceptual and cognitive operations. We are far from having these matters sorted out, but they are no longer inherently mysterious. The mystery has been replaced with a tangled web of puzzles.

Now we’re set up for Part 2: Virtual Reading. Here we see how, by exploiting this conceptual possibilities of this new conceptual ontology, we can begin to think about the human mind in a different way, and even think about how the mind changes over historical time.

* * * * *

That is then followed by the three sections in the first part of the paper. Then we have the introduction to the second part:

PART 2: VIRTUAL READING: PATHS THROUGH THE MIND AND THE MIND OVER HISTORICAL TIME

What is implicit in these three sections is the idea that we think of the mind has a high-dimensional space of word meanings. That idea is explicit in each of the pieces, though different mathematical models are used. The researchers who use these models do not, however, think of them that way, at least not so far as I can tell. They think of the models as being about texts, and thus about the meanings of words in texts. But how are these texts created? Where do the words come from? Surely the words are placed in texts by minds somehow engaged with the world, whether the real world or some fiction is secondary. If those high-dimensional spaces are about texts, they are also about the minds that created the texts.

Now, that is a strange thing, very strange, to think of the mind as a high-dimensional space of verbal meanings. If you wish, think of it as a metaphor, one we can operationalize through the use of mathematical models. And we don’t have to think of it as a metaphor that somehow exhausts the mind. I certainly don’t. There is more to the mind than words; there are perceptions, motivations, and feelings. We’re dealing with an idealization, a conception of the mind, not the mind itself.

In sections 4 and 5 – Augustine’s Path, A note on virtual reading and Mapping the pathways of the mind – I develop the idea of virtual reading, of using computational techniques to reveal the paths that texts trace through this high-dimensional space. But what, you might ask, is the meaning of those paths? What are these different regions they traverse, if indeed that is what they do? To answer those questions we need to interpret those paths. For that we’ll need some version of so-called “close reading”. Just what version, that is by no means obvious to me. But the job is one that must be done text by text. And it needs to be in touch with the newer psychologies – cognitive, neuro-, evolutionary – and not just the older categories of interpretive criticism. I rather hope and expect that some version of psychoanalytic ideas will prove fruitful.

Section 6 – Inferring the direction of the historical process underlying a corpus – is about the final study in Matthew Jockers’ 2013 Macroanalysis. Jockers was interested in influence. Whatever he may have discovered about that, since September of 2014 I have been arguing been arguing that he had uncovered something very interesting, and potentially quite profound, about cultural evolution. This is yet another version of that argument.

My point is simple. Jockers is working with a corpus of 3300 nineteenth century Anglophone novels. He showed that they are ordered among themselves according to similarity, that ordering is also temporal, from the beginning of the century through the end. That ordering most somehow reflect, be a product of, the process(es) which produced these novels. What else is there? Nothing. That process is one in which a large and changing population reflects on its experience. Through that dynamic process minds change in ways revealed in the texts which are the expressive vehicle of that process.

Is that so different from what we learn through convention literary history, and even the New Historicism? The terms are a bit different, certainly the mode of reasoning is. But aren’t these various versions of literary history about the change of mind over historical time? If not that, then what? Changes in texts? Well, yes, sure. But who creates the texts, who reads them. Don’t those people have minds? And don’t those texts reflect those minds?

From texts to minds, from minds to texts. Isn’t that how we’re reasoning? But now we’ve got new methods. Let’s use them.

* * * * *

That is then followed by the last three sections of the working paper.