NEW SAVANNA: Can you learn anything worthwhile about a text if you treat it, not as a TEXT, but as a string of marks on pages? [#DH]

Monday, December 9, 2019

Can you learn anything worthwhile about a text if you treat it, not as a TEXT, but as a string of marks on pages? [#DH]

With Matt Jockers' 3300 node graph on my mind, this post deserves a bump to the otp of the queue.

The Chronicle of Higher Education just published a drive-by take-down of the digital humanities. It was by one Timothy Brennan, who didn’t know what he was talking about, didn’t know that he didn’t known, and more likely than not, didn’t care.

Timothy Brennan, The Digital-Humanities Bust, The Chronicle of Higher Education, October 15, 2017, http://www.chronicle.com/article/The-Digital-Humanities-Bust/241424

Subsequently there was a relatively brief tweet storm in the DH twittersphere in which one Michael Gavin observed that Brennan seemed genuinely confused:

Hate to say it, but what lit critics understand as "the text" is only loosely related to signs on the page. Concept of text is incoherent.
— Bill Benzon (@bbenzon) October 17, 2017

“Lexical patterns”, what are they? The purpose of this post is to explicate my response to Gavin.

The Text is not the (physical) text

While literary critics sometimes use “the text” to refer to a physical book, or to alphanumeric markings on the pages in such a book, they generally have something vaguer and more expansive in mind. Here is a passage from a well-known, I won’t say “text”, article by Roland Barthes [1]:

1. The text must not be understood as a computable object. It would be futile to attempt a material separation of works from texts. In particular, we must not permit ourselves to say: the work is classical, the text is avant-garde; there is no question of establishing a trophy in modernity's name and declaring certain literary productions in and out by reason of their chronological situation: there can be “Text” in a very old work, and many products of contemporary literature are not texts at all. The difference is as follows: the work is a fragment of substance, it occupies a portion of the spaces of books (for example, in a library). The Text is a methodological field. The opposition may recall (though not reproduce term for term) a distinction proposed by Lacan: “reality” is shown [se montre], the “real” is proved [se démontre]; in the same way, the work is seen (in bookstores, in card catalogues, on examination syllabuses), the text is demonstrated, is spoken according to certain rules (or against certain rules); the work is held in the hand, the text is held in language: it exists only when caught up in a discourse (or rather it is Text for the very reason that it knows itself to be so); the Text is not the decomposition of the work, it is the work which is the Text's imaginary tail. Or again: the Text is experienced only in an activity, in a production. It follows that the Text cannot stop (for example, at a library shelf); its constitutive moment is traversal (notably, it can traverse the work, several works).

And that is just the first of seven propositions in that well known ~~text~~ article, which has attained, shall we say, the status of a classic.

I have no intention of offering extended commentary on this passage. I will note, however, that Barthes obviously knows that there’s an important difference between the physical object and what he’s calling the text. Every critic knows that. We are not dumb, but we do have work to do.

Secondly, perhaps the central concept is in that italicized assertion: “the Text is experienced only in an activity, in a production.”

Finally, I note that that first sentence has also been translated as: “The Text must not be thought of as a defined object” [2]. Not being a reader of French, much less a French speaker, I don’t know which translation is truer to the original. It is quite possible that they are equally true and false at the same time. But “computable object” has more resonance in this particular context.

Now, just to flesh things out a bit, let us consider a more recent passage, one that is more didactic. This is from the introduction Rita Copeland and Frances Ferguson prepared for five essays from the 2012 English Institute devoted to the text [3]:

Yet with the conceptual breadth that has come to characterize notions of text and textuality, literary criticism has found itself at a confluence of disciplines, including linguistics, anthropology, history, politics, and law. Thus, for example, notions of cultural text and social text have placed literary study in productive dialogue with fields in the social sciences. Moreover, text has come to stand for different and often contradictory things: linguistic data for philology; the unfolding “real time” of interaction for sociolinguistics; the problems of copy-text and markup in editorial theory; the objectified written work (“verbal icon”) for New Criticism; in some versions of poststructuralism the horizons of language that overcome the closure of the work; in theater studies the other of performance, ambiguously artifact and event. “Text” has been the subject of venerable traditions of scholarship centered on the establishment and critique of scriptural authority as well as the classical heritage. In the modern world it figures anew in the regulation of intellectual property. Has text become, or was it always, an ideal, immaterial object, a conceptual site for the investigation of knowledge, ownership and propriety, or authority? If so, what then is, or ever was, a “material” text? What institutions, linguistic procedures, commentary forms, and interpretive protocols stabilize text as an object of study? [p. 417]

“Linguistic data” and “copy-text”, they sound like the physical text itself, the rest of them, not so much.

If literary critics were to confine themselves to discussing the physical text, what would we say? Those engaged in book studies and editorial projects would have more to say than most, but even they would find such rigor to be intolerably confining. The physical signs on the page, or the vibrations in the air, exist and come alive in a vast a complicated network of ... well, just exactly what? Relationships among people to be sure, but also relationships between sights and sounds and ideas and movements and feelings and a whole bunch of stuff mediated by the nervous systems of all those people interacting with one another.

It’s that vast network of people and neuro-mental stuff that we’re trying to understand when we explicate literary and cultural Texts. As we lack really good accounts of all that stuff, literary critics have felt that we had little choice but to adopt this more capacious conception, albeit at the expense of definition and precision. Anyhow, aren’t the people trying to figure out those systems, aren’t they scientists? And aren’t we, as humanists, skeptical about science?

And then along came the computer.

An ontological gulf

The thing about computational criticism is that computers don’t have all that other stuff – a complex fluctuating network of interactions among people both containing and embedded in a vast and turbulent meshwork of flashing neurons (one of Tim Morton’s hyperobjects?) – available to them. The computer deals only with those dumb marks on the page, or rather, with digital representations of them. There is only the physical text, that dull thing that other literary critics pass over in favor of The Text.

Thus there is an ontological gulf between computational criticism and the many varieties of – I hate to say it – conventional literary criticism. Here I mean ontological in the sense it has come to have in the cognitive sciences, a sense where a more conventional humanist might talk of different discourses (Foucault) or paradigms (Kuhn). In this sense an ontology is an inventory of concepts. Salt, as ordinarily understood, and sodium chloride, exist in different ontologies in this sense [4]. Humans have known about salt since forever, and apprehend it through its taste, texture and color; even animals know about it. Sodium chloride is quite different conceptually, though physically, yes, it is salt. Conceptually it is defined in terms of bonds between atoms which themselves consist of electrons, protons, and neutrons, a set of concepts developed in Western science in the 18th and 19th centuries.

When it comes to the concept of text, the conventional critic and the computational critic operate in different conceptual worlds, with different ontologies. Yet there are subtleties.

More likely than not the computational critic has also (perhaps first) been trained as a conventional critic. The computational critic thus understands the senses of text I laid out in the previous section. But their work with computational criticism immerses them in a world, an ontology, where the text is just the marks on the page. This is not merely a matter of understanding definitions and concepts, but of the practical work of manipulating bodies of texts and analyzing them.

None of that text manipulation is real to the conventional critic. Sure, they understand that there is this physical text, they may well know that it consists of lexemes, too, and be quite willing to agree that, sure, there are patterns of lexemes there. But this latter knowledge is not very deep and flexible. It’s not supported by extensive computational practice. When the computational critic tells them that the context in which a word is used tells you something about the word’s meaning, and that words occurring in similar contexts must have something in common, they’re likely to reply: “So? We’ve known that for a long time. That’s trivial.” Well, yes, trivial. But it has consequences that are not so trivial, not when you have massive computational power available on your laptop computer. Those consequences are utterly baffling to the conventional critic.

And it’s not clear to me how well computational critics have been in explaining such things. Nor, for that matter, is it at all clear to me how far such explanations can, in principle, go. Simplification, metaphor, and patience will take you only so far.

Topic modeling

Let me offer an illustration of the problem: topic modeling.

I should first say that I don’t really understand the mathematics behind most of what’s done in corpus linguistics (and hence, computational criticism), including topic modeling. I don’t understand the programming techniques, either, but that’s not fundamental. The math is.

I had a standard mid-20th century secondary school math education: geometry, algebra, and trigonometry. I satisfied by college math requirement with a rigorous one-semester course in symbolic logic, which, given my interests, was a perfectly reasonable thing to do. I also had the barest of introductions to statistics in a sociology course.

That’s it.

But over the years I’ve managed to develop fairly sophisticated mathematical intuitions by 1) reading many technical explanations written for the technically naive, 2) reading a great deal technical literature, mostly in the cognitive sciences, and 3) working closely with people who have technical skills that I lack. That trivial, but key, insight behind much of corpus linguistics (including topic modeling), the one about words in context? I’ve known and thought about it for decades.

So, when I encountered that idea in “topic analysis for dummies” articles I was on-board immediately. I read three such articles, each by a good computational critic, but felt that something was missing. I was curious about one thing, just what that was I didn’t know. How could I? It was missing. So I decided to bite the bullet and look at a technical review article by David Blei [5]. There it was, at the top of page 81:

One assumption that LDA [Latent Dirichlet Allocation] makes is the “bag of words” assumption, that the order of the words in the document does not matter. [...] While this assumption is unrealistic, it is reasonable if our only goal is to uncover the coarse semantic structure of the texts.

That’s all I needed, that phrase, “bag of words”. I’d suspected something like that – the descriptions I was reading didn’t make any sense otherwise – but I wasn’t sure. I wanted confirmation.

Just to be sure we’re on the same page about this, here’s how I explained the concept in a working paper [6]:

What does that mean? Imagine that some document, any document—a poem by Denise Levertov, a play by Beaumarchais, a technical article by Richard Feynman, a novel by George Eliot, whatever—is printed out on single sides of paper sheets. Slice the sheets into thin strips each containing a single line of print; cut those strips into individual words like so many pieces of confetti; and gather all the individual snippets together and place them into a bag. THAT’s a bag of words.

The point is that the bag of words has lost all the structure that made those many words into a coherent text. Whatever it is that LDA is doing, it is not “reading” texts in any meaningful sense of the word. It knows nothing about syntax, nothing about semantics, nothing about discourse, and little about spelling. All it can do at the bag level, that is, at the level of individual texts, is recognize whether or not two snippets of paper contain the same set of characters (that is, a word) and count the number of snippets containing a given word. That’s all that is relevant for basic topic modeling, the list of words in each document and the number of times each word occurs in the document.

How then, can LDA possibly arrive at even remotely sensible topics for a set of documents? Obviously, it’s not doing it on a document-by-document basis. It doesn’t inspect a document, list the topics in that document, inspect another document, list the topics in it, and so forth. It’s doing something quite different.

That’s about as far as I’m going to go with this. The crucial point, though, is this: Once you’ve lost all the structure in individual documents by turning them into bags of words, you can recover some, only some, of that structure by doing a massive comparison across all the documents. That’s what you need the computing power for, to undertake that comparison. What is it about language that allows some of that lost structure to emerge in the process of such a comparison? The rest, as they say, is an exercise for the reader, though you’re welcome to read what I have to say in that working paper.

I have no idea how much of that underlying math is understood by most computational critics. Some of them surely know it well, some little or not at all. Even those who don’t know the math have the practical experience of working with topic analysis.

But the conventional critic likely has neither the math nor the practical experience. Thus they have no way of understanding topic analysis (if they’re even willing to attempt such understanding) as anything other than magic. If, however, they’re skeptical about mathematics and computing, then they won’t bother to try. They’ll assume the magic is black magic and so is best shunned.

Recovering the material text: for form, in description

Let us shift our attention from computational criticism. I have spent a lot of time thinking about it, and believe it is of fundamental importance. I certainly want to see the work continue.

But it is not the kind of work that I do myself. I like to work with individual texts, most recently films rather than verbal texts. I’m interested in analyzing them and describing their formal features. That requires that I attend to the material text in a way that is uncharacteristic of conventional literary criticism. Thus I have developed an interest in ring composition [7], a topic that dates back to the 1950s, but that has been all but neglected in recent decades, and even subjected to deconstructive ridicule [8].

What is ring composition? Texts structured roughly like this:

A, B, C, ... X ... C’, B’, A’

The first item in the string is ‘answered’ by the last item, the second item is answered by the penultimate item, and so forth. By string I mean, of course, the text, the text considered as an arrangement of signs. The term, “string”, is common in linguistics and computational linguistics (and, for that matter, in computer science generally). That, a string, is the physical form of the text – films as well.

Of course, in order to determine whether or not items are appropriately arranged I have to identify those items. But that generally doesn’t require much interpretive ingenuity. Thus, the 1933 film King Kong begins in New York and ends there. That’s simple enough to begin the further examination required to determine whether or not it is, in fact, a ring-form text, which I believe it is [9].

A discipline that, however, has all but discarded the physical text in favor of that more capacious and nebulous entity characterized by Roland Barthes and others, that discipline is not going to pay attention to such descriptive matters. And yet we now have critics talking about “surface reading” and description. But I see little interest in the physical text.

Representations recently devoted a whole issue to description (Summer 2016). There were no papers by linguists, who, after all, know a great deal about describing language – and literature is made of language, no? Nor any papers by narratologists or poeticists? Why not something by Haj Ross, who was trained by both Chomksy (in linguistics) and Jakobson (in poetics) and who has done quite a bit of descriptive work on poetry? Why not a paper from a computational critic, someone who’s necessarily neck deep in descriptions?

I don’t know. My guess is that the organizers are so thoroughly steeped in the discipline’s mythology of The Text that it didn’t occur to them to seek out such papers. Nor, so far as I can tell – I haven’t been able to read any of the papers other than the introduction [10] – do the papers report much about or give many examples of practical experience in the analysis of literary texts. At the moment the project of describing literary texts seems mostly to be a theoretical possibility. So I ask: if you’re not going to make the physical text central to your work, just what are you going to do [11]?

Perhaps as the discipline of literary criticism attempts to understand computational criticism, it will come to think seriously about the mere text, the physical symbols on the page. Perhaps then it will be able to see that form is an actual describable thing, not merely a metaphysical premise. If not, the move to description will be still born.

References

[1] Roland Barthes, “From Work to Text”, The Rustle of Language, Richard Howard, trans. Richard Howard, Hill and Wang, 1986. Originally published as ‘De l’ceuvre au texte’, Revue d'esthitique 3, 1971. 

[2] Rptd. in Josue V. Harari, ed., Textual Strategies: Perspectives in Poststructuralist Criticism (Ithaca, NY: Cornell UP, 1979), 73-81.

[3] Rita Copeland and Frances Ferguson, “Introduction”, ELH, Volume 81, Number 2, Summer 2014, pp. 417-422.

[4] I address this specific example in “Ontology of Common Sense”, Hans Burkhardt and Barry Smith, eds. Handbook of Metaphysics and Ontology, Muenchen: Philosophia Verlag GmbH, 1991, pp. 159-161. https://www.academia.edu/28723042/Ontology_of_Common_Sense

For a more general and more technical discussion, see my Ontology in Knowledge Representation for CIM, Center for Manufacturing Productivity and Technology Transfer, Rensselaer Polytechnic Institute. Report No. CIMNW85TR034, January 1985. https://www.academia.edu/19804747/Ontology_in_Knowledge_Representation_for_CIM

[5] David Blei, Probabilistic topic models (Communications of the ACM, 55(4): 77–84, 2012), http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf

[6] Corpus Linguistics for the Humanist: Notes of an Old Hand on Encountering New Tech, Working Paper, July 2013, 21 pp., https://www.academia.edu/4066521/Corpus_Linguistics_for_the_Humanist_Notes_of_an_Old_Hand_on_Encountering_New_Tech

[7] See, for example, Mary Douglas, Thinking in Circles: An Essay on Ring Composition, Yale University Press, 2007. For my work, see, for example, Ring Composition: Some Notes on a Particular Literary Morphology, Working Paper, September 11, 2017, 71 pp., https://www.academia.edu/8529105/Ring_Composition_Some_Notes_on_a_Particular_Literary_Morphology

[8] James J. Paxson, Revisiting the deconstruction of narratology: master tropes of narrative embedding and symmetry. Style, Vol. 35, No. 1 Spring 2001, 126-150.

[9] I have two recent posts about King Kong: “Beauty and the Beast: King Kong as ring composition, plus myth logic”, New Savanna, blog post, accessed October 18, 2017, https://new-savanna.blogspot.com/2017/10/beauty-and-beast-king-kong-as-ring.html

“Comparative Rings: To the grocer’s, King Kong, Heart of Darkness”, New Savanna, blog post, accessed October 18, 2017, https://new-savanna.blogspot.com/2017/10/comparative-rings-to-grocers-king-kong.html

[10] Sharon Marcus, Heather Love, and Stephen Best, Building a Better Description, Representations 135. Summer 2016. 1-21. DOI: 10.1525/rep.2016.135.1.1.

[11] See my recent post, “Describing structured strings of characters [literary form]”, New Savanna, blog post, accessed October 18, 2017, http://new-savanna.blogspot.com/2017/09/describing-structured-strings-of.html

NEW SAVANNA

Pages in this blog

Monday, December 9, 2019

Can you learn anything worthwhile about a text if you treat it, not as a TEXT, but as a string of marks on pages? [#DH]

No comments:

Post a Comment