In the weeks since writing about the preliminary results Goldstone and Underwood have reported of their work on topic analysis of PMLA (Publications of the Modern Language Association) I’ve continued to think about Natalia Cecire’s reservations. You may recall that she’s comfortable using topic analysis to point toward interesting texts that she will then examine for herself. But she had doubts about using it as evidence itself. For that, she’d need to have “a convincing theory of what the math has to do with the structure of (English) language” and that, presumably, requires some detailed knowledge both of language and of math.
I’m sympathetic to her reservations. What I’ve been thinking about is this: Just WHAT would she want to know?
Ted Underwood offered a response to her in which he referenced a Wikipedia article on distributional semantics, which I’ve read. He summarized that article, which is a short one, thus: “It’s just … things that occur in the same contexts gotta have something in common.” I agree with that summary.
And that’s the problem. That is what has had me thinking about this matter. Underwood’s statement is brief and easy to understand. What’s the problem?
I believe, in fact that “a convincing theory of what the math has to do with the structure of (English) language” would not be terribly useful to Cecire, not nearly so useful as simply playing around with topic analysis over a period of time by going back and forth between the computer-generated topics and associated texts. Only doing this time after time Cecile will be able to verify, for herself, that the computer-identified topics are meaningful entities. Though I don’t know this, I suspect that, whatever they may know about the computational processing behind topic analysis, such play has been important to both Goldstone and Underwood and to anyone else who uses the technique.
Thus I know that this, yet another run at explaining topic modeling, is bound to fail, as it cannot possibly substitute for the requisite experience. All I’m after is a different way to thinking about the technique and the strange conceptual objects it creates.
Bags of Words
I’ve read several accounts of topic modeling, this one by Matt Jockers, this one by Scott Weingart, this one by Ted Underwood and, finally, this technical review by David Blei: Probabilistic topic models [PDF] (Communications of the ACM, 55(4): 77–84, 2012). The first three were written for humanists while the last was written for computer scientists. It contained a most useful phrase, “bag of words.” That’s how the basic topic modeling technique, something called Latent Dirichlet Allocation (LDA) treats individual texts, as bags of words.
What does that mean? Imagine that some document, any document—a poet by Denise Levertov, a play by Beaumarchais, a technical article by Richard Feynman, a novel by George Eliot, whatever—is printed out on single sides of paper sheets. Slice the sheets into thin strips each containing a single line of print; cut those strips into individual words like so many pieces of confetti; and gather all the individual snippets together and place them into a bag. THAT’s a bag of words.
The point is that the bag of words has lost all the structure that made those many words into a coherent text. Whatever it is that LDA is doing, it is not “reading” texts in any meaningful sense of the word. It knows nothing about syntax, nothing about semantics, nothing about discourse, and little about spelling. All it can do at the bag level, that is, at the level of individual texts, is recognize whether or not two snippets of paper contain the same set of characters (that is, a word) and count the number of snippets containing a given word. That’s all that is relevant for basic topic modeling, the list of words in each document and the number of times each word occurs in the document.
How then, can LDA possibly arrive at even remotely sensible topics for a set of documents? Obviously, it’s not doing it on a document-by-document basis. It doesn’t inspect a document, list the topics in that document, inspect another document, list the topics in it, and so forth. It’s doing something quite different.
Remember what Underwood said: “things that occur in the same contexts gotta have something in common.” Each bag, that is, each document, is treated as a context, a context for words, but also topics. What we’re looking for is groups of words that occur together in the same documents. The more documents the better. The PMLA corpus that Goldstone and Underwood have been working with has almost 6000 documents. Blei’s article mentions techniques that have been tried with millions of documents.
Finding Topics in Haystacks
Let’s think about this a bit. We aren’t necessarily interested in groups of words that recur in all the documents, though that’s likely to be the case with grammatical function words such as articles, pronouns, conjunctions and so forth. But if groups of 10, 20, 30 or 100 words keep showing up in, say, 60, or 100, or 150 different articles, chances are that each such group is being used to make more or less the same kinds of assertions. Let’s call such a group a topic.
Here we have three real topics which I’ve simply lifted from Goldstone and Underwood:
topic 38: time experience reality work sense form present point world human process structure concept individual reader meaning order real relationshiptopic 46: novels fiction poe gothic cooper characters richardson romance narrator story novelist reader plot novelists character reade hero heroine drftopic 13: point reader question interpretation meaning make reading view sense argument words word problem makes evidence read clear text readers
The topics are identified by a number that the software assigns: 38, 46. 13. While it may be difficult to attach a meaningful verbal label to each of those topics, the lists themselves make a kind of loose sense. One can see those collections of words as being more or less about the same thing.
Topic modeling assumes that each document, that is, each bag of words, is composed of several different topics. Thus Goldstone and Underwood developed models using 100 and 150 topics for the PMLA corpus of roughly 6000 documents. The number of topics is thus considerably smaller than the number of documents. So, if each document is composed of, say, half a dozen topics, and we have 100 possible topics, then we have 100 * 99 * 98 * 97 * 96 * 95 possible topic combinations, which is quite a lot.
What we do, in effect, is to use the computer to do a vast comparison and contrast of all the documents in a collection looking for groups of words that co-occur across multiple documents. We’re looking for a suitable selection of topics such that each given bag of words, that is, each document, can be filled by words from a suitable combination of topics. The process requires quite a bit of bookkeeping, and some fancy mathematics to deal with the fact that a given word can occur in different topics—“fly” in a zoology topic and a transportation topic—and the fact that topics aren’t going to be sharply defined by word lists.
Topic modeling also assumes that the process that produced the documents is a highly structured one, one that has left traces of that structure in the documents themselves. The technique depends on those traces. If the documents were just random collections of words, then topic modeling would have nothing to work with and would produce no intelligible results.
Think of it as a perceptual process like seeing. The eyes of a newborn know nothing of the world, but nonetheless can learn to see objects that are there in the world. And so it is with topic modeling. It knows nothing about the meanings of documents, but it can learn to see crude shapes and figures in them. It can learn to see things which, for convenience, we all topics.
Let’s return to Cecire’s desire for “a convincing theory of what the math has to do with the structure of (English) language.” The math, it turns out, has little or nothing to do with the structure of English or any other language, at least not as such structure is understood by linguists. In fact we could replace each word with a colored pebble where the exact color corresponds to word identity. In this case each bag would a collection of marbles and the modeling process would be identifying components (topics) of such collections.
Thus we find that, near the end of his review article, Blei generalizes the technique:
As a class of models, LDA can be thought of as a mixed-membership model of grouped data—rather than associating each group of observations (document) with one component (topic), each group exhibits multiple components in different proportions. LDA-like models have been adapted to many kinds of data, including survey data, user preferences, audio and music, computer code, network logs, and social networks.
He then goes on to present two other examples, one from population genetics and another from computer vision:
In document analysis, we assume that documents exhibit multiple topics and the collection of documents exhibits the same set of topics. In image analysis, we assume that each image exhibits a combination of visual patterns and that the same visual patterns recur throughout a collection of images. (In a preprocessing step, the images are analyzed to form collections of “visual words.”) Topic modeling for computer vision has been used to classify images, connect images and captions, build image hierarchies, and other applications.
The technique is a rather general one for discovering structure in various classes of objects.
New Objects, New Concepts, New Worlds
“That’s all well and good,” you say, “but isn’t topic modeling just a kludge? Isn’t it just a poor substitute for a careful analysis that’s too time consuming to be undertaken?”
I suppose such an argument can be made, but I’m not terribly inclined to believe it. Let’s take Goldstone and Underwood’s work on the PMLA corpus. Imagine that some eccentric billionaire wanted to pay for a manual examination and analysis of that corpus. How would you go about it and what problems would you have?
Well, maybe we can do something like this: Let’s hire 30 scholars to undertake the work. Each of them will read and analyze 200 different articles. They’ll write up their individual results, share the work around, and write up a final report. They should be done in a year or so.
That makes sense as long as you don’t think about how what’s actually required and how people work together, or not. If the final result is to make any sense at all, then they have to agree on terms and concepts. Where is that agreement going to come from? Will it be imposed at the beginning? Could that possibly work? Or do we let each investigator come to terms with their particular set of articles and then they can negotiate things as they go along? And, by the way, how will we assign the articles to our investigators? The most obvious scheme would simply be to arrange them in chronological groups. But we can also assign them randomly, which is likely to give each individual a look at the whole century-long sweep.
No, the more I think about THAT process, the less attractive it becomes. My guess is that, in the end, we’re likely to end up with half-a-dozen alternative reports, each of them long and difficult. Which is probably not what our eccentric billionaire had in mind. In fact, at this point he’s likely to suggest that they start with Goldstone and Underwood’s topic analysis and then take if from there.
And he might wonder whether or not the job requires all 30 investigators and whether or not it is in fact necessary or even useful to have each and every article read and analyzed.
No, as far as I can tell, topic modeling allows us to see things we could not see before. These things are new, and so a bit strange. The only way to get around the strangeness is to investigate, to work with the models, never taking them at face value, but not treating them as mere stepping-stones to something else. Those 6000 PMLA articles were created by some thousands of scholars over the course of a century of work. Topic modeling allows us to explore that collective effort in new ways. The challenge it presents to us is to come up with new ways of thinking about these new objects of knowledge, these topic models.