Saturday, August 26, 2017

Virtual reading as a path through a multidimensional-dimensional semantic space [#DH]

In my post, In search of a small world net, I speculated about analyzing Heart of Darkness with some appropriate vector space semantic model so that we could then construct a directed graph of the word types in the text where the length of the edges would be proportional to the distance between the words in the high-dimensional space constructed in the model. That post was about using this model as a way of investigating the centrality of a critical phrase (“My Intended, my ivory, my station, my river, my—“).

But I ended the post with a short note on “virtual reading”:
Consider the connected graph for the emblem phrase. It should be easy enough to calculate a central point for the phrase, no? But then, couldn’t we do that for any sentence or phrase? So, start at the beginning of the text and move sequentially through the text with a moving window of suitable length. Calculate the central point for the phrase within the window and trace the movement of successive centers through the text from beginning to end.

Such a “reading” would not, of course, yield the computer anything like an understanding of the text. That’s not why it interests me. I’m interested in the form the trajectory traces through the space. For example, how does it move with respect to the center of the emblem? What about the volume spanned by the subgraph within this moving window? How does it expand and contract. And so forth.
In looking around on my hard-drive I came across a 2010 article on more or less just that. In the next section of this post I take a quick look at that paper. Then I return to Heart of Darkness, offer some more general remarks, and conclude with remarks about feasibility and intellectual imagination.

It’s gonna’ be another long one.

Hierarchy and dynamical correlations

Here’s that article:
E. Alvarez-Lacalle, B. Dorow, J.-P. Eckmann, and E. Moses. Hierarchical structures induce long-range dynamical correlations in written texts. PNAS Vol. 103 no. 21. May 23, 2006: 7956–7961. doi: 10.1073/pnas.0510673103
I’m sure I hadn’t read it, but I certainly must have skimmed it. Here’s the abstract:
Thoughts and ideas are multidimensional and often concurrent, yet they can be expressed surprisingly well sequentially by the translation into language. This reduction of dimensions occurs naturally but requires memory and necessitates the existence of correlations, e.g., in written text. However, correlations in word appearance decay quickly, while previous observations of long-range correlations using random walk approaches yield little insight on memory or on semantic context. Instead, we study combinations of words that a reader is exposed to within a “window of attention,” spanning about 100 words. We define a vector space of such word combinations by looking at words that co-occur within the window of attention, and analyze its structure. Singular value decomposition of the co-occurrence matrix identifies a basis whose vectors correspond to specific topics, or “concepts” that are relevant to the text. As the reader follows a text, the “vector of attention” traces out a trajectory of directions in this “concept space.” We find that memory of the direction is retained over long times, forming power-law correlations. The appearance of power laws hints at the existence of an underlying hierarchical network. Indeed, imposing a hierarchy similar to that defined by volumes, chapters, paragraphs, etc. succeeds in creating correlations in a surrogate random text that are identical to those of the original text. We conclude that hierarchical structures in text serve to create long-range correlations, and use the reader’s memory in reenacting some of the multidimensionality of the thoughts being expressed.
That “window” they mention is the key. They move the window through a text from beginning to end. That is pretty much what I’ve called virtual reading. The computer doesn’t do anything remotely like understand the text, but it reveals statistical features that the authors of the study attribute jointly to the structure of the text and the attention and memory of a reader.

They used nine fiction texts:
War and Peace by Tolstoi
Don Quixote by Cervantes
The Iliad by Homer
Moby-Dick: or, The Whale by Melville
David Crockett by Abbott
The Adventures of Tom Sawyer by Twain
Naked Lunch by Burroughs
Hamlet by Shakespeare
The Metamorphosis by Kafka
Three non-fiction:
Relativity: The Special and the General Theory, by Albert Einstein
Critique of Pure Reason by Kant
The Republic by Plato
I’ve skimmed the article and they do not report any differences between the fiction and non-fiction texts, which is neither here nor there at this point. Beyond this I should note that the article is quite technical. For me to get much more out of it I would have to talk with someone who understands it.

A virtual reading of Heart of Darkness, and beyond

Still, when I proposed virtual reading in that previous post I had something different in mind. I wanted to examine the trajectory of a virtual reader through our high-dimensional semantic space. One issue, alluded to in those paragraphs from the previous posts, is, in effect, to define just what we mean by a trajectory.

That previous post was about a particular phrase, “My Intended, my ivory, my station, my river, my—“. What interested me about that phrase is that the words seem located in widely separated locations in semantic space, but were conjoined in by their relation to Kurtz. What’s the location of that phrase in semantic space? Is it a point within the volume implied by those words, or do we track the volume? Alvarez-Lacalle et al. used a window of 100 words to move through their texts because they were modeling reader attention. We’re not doing that. How big a window should we use? A single word? The Conrad phrase is on the order of 10 words or so. Is that what we want? Perhaps rather than a fixed number of words we use a syntactic unit. But what? A sentence? Sentences vary in length quite a bit. And so on. I don’t have answers to these questions. Perhaps we try different approaches. [1]

Let’s move on.

Assume, for the sake of argument, that we’ve done the virtual reading. How do we now examine this trajectory? It is, after all, that trajectory is exists in a high-dimensional space (actually, not so high as these things go, but higher than the 3D space of ordinary vision), and we can only see in three dimensions. We’re going to have to project this trajectory onto a plane, which is common enough.

What might we expect to find? Consider this chart, where each bar represents a paragraph from Heart of Darkness [2]. The bars are arranged in paragraph order, going first to last from left to right, and the length of each bar is proportional to the number of words in the paragraph:
HD whole envelope
That longest paragraph (#103 in the Project Gutenberg text) more or less in the center, that’s where the emblematic phrase (“My Intended, my ivory, my station, my river, my—“) first occurs in the text. That’s also the first time we learn much about Kurtz other than his name. The emblem reoccurs for a second and last time in paragraph 148. Kurtz dies in paragraph 156 and there are 198 paragraphs in the text.

Since ¶103 gives a prĂ©cis of Kurtz’s life and it differs from anything we’ve seen previously in the text, we would expect the trajectory through ¶103 to traverse new regions in this high-dimensional semantic space. That paragraph, moreover, is preceded and followed by paragraphs detailing the wounding and death of the helmsman, another distinct region in semantic space. Is the semantic region of the helmsman’s death near or the same as the region of Kurtz’s death? The word “dead”, after all, occurs at both places in the text – though those aren’t the only places it occurs.

Consider this figure, called a periodogram [3]:
HoD500
I divided the text into 500 word bins and counted the number of times the word “Kurtz” appeared in the bins. You can see that its appearance is roughly periodic and that the overall number of occurrences increases with the longest paragraph (103, identified as nexus in the chart), which we would expect, as that’s the point Kurtz himself enters the story.

How does this periodicity show up in semantic regions traversed in the trajectory of (virtual) reading? Is there periodic activity in the early part of the text, where “Kurtz” doesn’t appear at all? If so, what is its content?

What we are doing is investigating the relationship between the text, considered as a one-dimensional sequence of word tokens (a string), and the structure of the high-dimensional semantic space characterizing (features of) word meaning. On the one hand we have features of the text that we have already identified by other means, whether traditional close reading (that longest paragraph, the emblem phrase) or other quantitative methods (the periodogram). And we examine the corresponding regions of the trajectory. We could, as well, move from a feature detected in the trajectory to the corresponding segment or segments of the text. One could, in principle, move through the entire text like this, which would yield textual commentary of unprecedented detail. Would it be worth the effort? How can we tell without trying?

But we would also want to compare trajectories for different texts. For example, do tragedies have a characteristic trajectory in common? Do comedies? One might think so, given that we have already defined them, as in some sense, similar. But who knows?

The structuralists have proposed that texts are constructed over patterns of binary oppositions. When we examine the trajectories for different texts, can we identify opposed regions of semantic space? What happens if we reduce the dimensionality of a space by treating binary oppositions as the poles along a single dimension? Does that clarify the trajectory of a virtual reading? What happens to the space for Heart of Darkness, for example, if we project it onto a plane defined by Africa-Europe as one dimension and male-female as the other?

What about one of my current hobby-horses, ring-composition? [4] It has been my experience that identifying ring-composition is not easy. Ring-form texts do not have obvious markers. More often than not I identified a given text as ring-form only after I had been working on it for a while, in some cases quite awhile. Perhaps ring-composition would show up in the trajectory of a virtual reading. If not, well, in that case, does it mean that there’s something wrong with the idea of ring-composition, or of virtual reading?

Questions, so many questions. But then that’s the point, isn’t it, to ask questions.

Is this possible?

I don’t really know.

As should be obvious, I don’t have the technical skills to do it myself. But when I consider what has already been done, it seems pretty clear to me that something generally along these lines should be possible. Alvarez-Lacalle et al. did their work over a decade ago. All I’m proposing is a virtual reading of a different kind, to a different end. I don’t think the difference is a trivial one, but I don’t see it as insurmountable either. The tracing of trajectories in high-dimensional spaces, after all, has been routine in complex dynamics for decades.

In the end I suspect the issue is one of will and, above all, intellectual imagination. Those are rather different from matters of technical feasibility.

References

[1] Or perhaps we want to work with thought vectors:
A thought vector is like a word vector, which is typically a vector of 300-500 numbers that represent a word. A word vector represents a word’s meaning as it relates to other words (its context) with a single column of numbers.

That is, the word is embedded in a vector space using a shallow neural network like word2vec, which learns to generate the word’s context through repeated guesses.

A thought vector, therefore, is a vectorized thought, and the vector represents one thought’s relations to others. A thought vector is trained to generate a thought’s context. Just as words are linked by grammar (a sentence is just a path drawn across words), so thoughts are linked by a chain of reasoning, a logical path of sorts.
[2] I’ve examined paragraph length in a short working paper: Paragraph Length in Heart of Darkness: Some Basic Numbers and Charts, July 27, 2011, 6 pp., https://www.academia.edu/7816429/Paragraph_Length_in_Heart_of_Darkness_Some_Basic_Numbers_and_Charts

A somewhat longer working paper considers paragraph length as one among many aspects of the text: Heart of Darkness: Qualitative and Quantitative Analysis on Several Scales, October 2, 2015, 49 pp., https://www.academia.edu/8132174/Heart_of_Darkness_Qualitative_and_Quantitative_Analysis_on_Several_Scales

[3] See Periodicity in Heart of Darkness: A Working Paper, July 28, 2011, 5 pp., https://www.academia.edu/7816423/Periodicity_in_Heart_of_Darkness_A_Working_Paper

[4] For example analysis and discussion, see the working papers in this section of my Academia.edu page: https://independent.academia.edu/BillBenzon/Ring-Composition

This is perhaps the most general single paper: Ring Composition: Some Notes on a Particular Literary Morphology, September 28, 2014, 70 pp., https://www.academia.edu/8529105/Ring_Composition_Some_Notes_on_a_Particular_Literary_Morphology

No comments:

Post a Comment