NEW SAVANNA: Computational linguistics & NLP: What’s in a corpus?

Monday, September 3, 2018

Computational linguistics & NLP: What’s in a corpus? – MT vs. topic analysis [#DH]

What’s in a corpus? Words, words organized into texts. Of course.

But that obvious answer not quite what I’m after. I’m interested in how we think about corpora, their role in our work. “We”, who’s that? I’m not sure it much matters, not exactly. It will emerge.

How are these corpora connected to the world? What do we hope to understand about the world by analyzing these corpora? What are our intuitions in these matters. To some extent I’m trying to track down something I don’t know how to conceptualize. This post from October 20, 2017 is a good example of that:

Borges redux: Computing Babel – Is that what’s going on with these abstract spaces of high dimensionality? [#DH], http://new-savanna.blogspot.com/2017/10/borges-redux-computing-babel-is-that.html

But when stalking such an abstract beast, it helps to have specific examples in mind. So I’m thinking about the role of corpora in statistical machine translation (MT) vs. their role in topic analysis. Roughly speaking, in MT statistical analysis of corpora are is a means to an end. In topic analysis statistical analysis of a corpus is the end. That difference entails a somewhat different way of thinking about corpora.

Martin Kay, an “ignorance model”

I’m basing this post on some observations by Martin Kay, one of the grand old men of MT. Kay apprenticed with Margaret Masterman at the Cambridge Language Research Unit in the 1950s. In 1951 David Hays hired him to work with the RAND group in MT. He went on to a distinguished career in computational linguistics at the University of California, Irvine, the Xerox Palo Alto Research Center, and Stanford.

Research in MT was undertaken to achieve a practical end, the translation of texts from one language to another. The United States government was particularly interested in obtaining translation of Russian texts. The researchers who undertook this work had various motivations, but some of them were interested in linguistic science and were happy enough to have their work funded by a government agency, the Department of Defense, with a practical goal.

In 2005 the Association for Computational Linguistics gave Kay a Lifetime Achievement Award. On that occasion he looked back over his career and made some observations about the relative merits of statistical and symbolic approaches to MT [1]. He speaks as a man fundamentally interested in basic knowledge who has, however, at times undertaken work with practical ends.

At the beginning of the following passage Kay distinguishes between computational linguistics and natural language processing (NLP). The distinction is a common one, albeit a bit problematic as well [2]. But the distinction Kay makes is clear enough (p. 5):

Computational linguistics is not natural language processing. Computational linguistics is trying to do what linguists do in a computational manner, not trying to process texts, by whatever methods, for practical purposes. Natural Language Processing, on the other hand, is motivated by engineering concerns. I suspect that nobody would care about building probabilistic models of language unless it was thought that they would serve some practical end. There is nothing unworthy in such an enterprise. But ALPAC’s conclusions are as true today as they were in the 1960’s—good engineering requires good science. If one’s view of language is that it is a probability distribution over strings of letter or sounds, one turns one’s back on the scientific achievements of the ages and foreswears the opportunity that computers offer to carry that enterprise forward.

I agree with Kay’s fundamental point, though I note that humanists using NLP techniques are often pursuing basic knowledge rather than a practical end.

Kay wrote that passage in 2005. I don’t know just when literary critics first began exploring NLP techniques, but I first became aware of digital humanities work in topic modeling sometime in 2012 [3]. That’s well after Kay wrote those words.

Let’s return to his remarks. Nearing the end of his talk, Kay remarks (p. 12):

My professional life almost encompasses the history of computational linguistics. But I was only fourteen when Warren Weaver wrote his celebrated memorandum drawing a parallel between machine translation and code breaking. He said that, when he saw a Russian article, he imagined it to be basically in English, but encrypted in some way. To translate it, what we would have to do is break the code and the statistical techniques that he and others had developed during the second world war would be a major step in that direction. However, neither the computer power nor large bilingual corpora were at hand, and so the suggestions were not taken up vigorously at the time. But the wheel has turned, and now statistical approaches are pursued with great confidence and disdain for what went before. In a recent meeting, I heard a well known researcher claim that the field had finally come to realize that quantity was more important than quality.

The young Turks blame their predecessors, the advocates of so-called symbolic systems, for many things. Here are just four of them. First, symbolic systems are not robust in the sense that there are many inputs for which they are not able to produce any out- put at all. Second, each new language is a new challenge and the work that is done on it can profit little, if at all, from what was done previously on other languages. Third, symbolic systems are driven by the highly idiosyncratic concerns of linguists rather than real needs of the technology. Fourth, linguists delight in uncovering ambiguities but do nothing to resolve them. This is actually a variant of the third point.

Kay mounts a quick defense on the first three points, but says a bit more about the fourth, ambiguity (pp. 12-13):

This, I take it, is where statistics really come into their own. Symbolic language processing is highly nondeterministic and often delivers large numbers of alternative results because it has no means of resolving the ambiguities that characterize ordinary language. This is for the clear and obvious reason that the resolution of ambiguities is not a linguistic matter. After a responsible job has been done of linguistic analysis, what remain are questions about the world. They are questions of what would be a reasonable thing to say under the given circumstances, what it would be reasonable to believe, suspect, fear or desire in the given situation. If these questions are in the purview of any academic discipline, it is presumably artificial intelligence. But artificial intelligence has a lot on its plate and to attempt to fill the void that it leaves open, in whatever way comes to hand, is entirely reasonable and proper. But it is important to understand what we are doing when we do this and to calibrate our expectations accordingly. What we are doing is to allow statistics over words that occur very close to one another in a string to stand in for the world construed widely, so as to include myths, and beliefs, and cultures, and truths and lies and so forth. As a stop-gap for the time being, this may be as good as we can do, but we should clearly have only the most limited expectations of it because, for the purpose it is intended to serve, it is clearly pathetically inadequate. The statistics are standing in for a vast number of things for which we have no computer model. They are therefore what I call an “ignorance model”.

That last point is very important.

By the mid-to-late 1960s computational linguists began to realize that they would have to tackle semantics, which covers the relationship between language and the world, in contrast to syntax, morphology, and phonology, which are all internal to the language system. And so computational linguists did that, along with cognitive psychologists and researchers in artificial intelligence. The work then, and now, was interesting and fruitful, but not terribly useful for practical tasks, such as MT. By the 1990s the conjunction of large amounts of cheap computing power and large bodies of digital texts gave statistical approaches a definitive edge in practical applications, an edge which remains.

The use of corpora in MT and topic modeling

Returning to Kay, remember that MT is his paradigm case. In that case the practical end is, to date, much better served by statistical techniques, though those techniques remain far from human quality.

Let’s think a bit about how corpora are used in statistical MT. The translation system is created off line. You start with two parallel natural language corpora, one of which is the translation of the other. You then conduct a statistical analysis of the corpora to produce a translation model and a language model. Those models are the heart of your (practical) system. When one uses such a system, one presents a string in one language and the system provides a translation.

As Kay remarks in a more recent paper ([3] p. 21):

The translation model proposes sets of words and phrases that might, with certain probabilities, translate individual words and phrases in an original sentence. The language model selects one candidate from each of set and arranges them into a string that looks as nearly as possible like a sentence of the target language. What makes a string of words look like a sentence in a given language? The answer is that as many of its substrings as possible occur frequently as substrings of sentences in a large corpus of text belonging to that language.

The corpora on which the model was built no longer play a role in the process. They have done their job.

Just what was that role? Recall Kay’s remark, “The statistics are standing in for a vast number of things for which we have no computer model.” That is to say, they are a stand-in for a semantics that we have not been able to create, a semantics that links words to the world. Let’s repeat that, they are a stand-in for a semantics that we have not been able to create.

How is it that we are able to create this pseudo-semantics through a statistical analysis of linguistic corpora? The texts in a corpus were created by authors engaged with the world. These authors know what the words mean, how they summon the world. That “cat” and “paw” often occur in the same contexts is not a fact about phonology, morphology, or syntax; it represents a fact about the world and so is in the province of semantics. The same is true of “human” and “race” in one context and “horse” and “race” in another. it is humans who know such things, not computers.

The more examples the creators of translation systems have at their disposal, the more contexts they have for each word and consequently the more fine-grained the statistical analysis on which their translation system is based. But no matter how vast the corpora, such procedures will not produce a true semantics. Why not? Because the necessary information simply isn’t there in the corpora. That information exists only in the relationship between the texts and the world.

Someone using an MT system is interested in what a foreign language text says about the world. They’re asking the system to provide a translation of that text into a language they understand so that they can, in turn, understand something more about the world. The orientation of someone performing a topic analysis is somewhat different. They may or may not be interested in what a body of texts has to say about the world, but that is not the immediate object of the topic analysis. The point of topic analysis is to arrive at some estimate of what topics are covered in some body of texts. The texts in the corpus are themselves the object of analysis. And that analysis may not have a practical end, to return to one pole of Kay’s opposition between practical ends and scientific ends.

Now, let us imagine that we are in possession of a rich and various computational semantics, one that allows us not only to dispense with statistical translation models but that produces better translations. We need not imagine that it produces translations as good as the best human translations, but only that they are better than current statistically mediated translation. Surely that would be a good thing, though I have no idea when such a creature will be available.

What would such a system be able to do if presented with the task performed by current topic analysis models? It is not at all obvious to me that such a system would be able to replace those topic analysis models. Faced with a corpus of 3300 19th century Anglophone novels, such as Matthew Jockers examined in Macroanalysis, what would it do? Well, it could “read” them one by one until it had read them all. What then?

Imagine that you’ve read, say, 20 novels in that corpus. How would you go about 1) identifying that topics in them, and estimating the prevalence of each of those topics in each of your 20 novels? Remember, you’re better at reading than that computer is. That’s a difficult task and it’s not at all obvious to me how to go about it, not for the human with only 20 novels and not for the computer with 3300.

What I’m suggesting then is that topic analysis, with its Bayesian statistics, is not, like statistical ME, just a stop-gap until something better comes along. It may well be that something better, which is to say that something better will take the form of improvements over current techniques.

References

[1] Kay, M.: A Life of Language. Computational Linguistics 31(4), 425-438 (2005). http://web.stanford.edu/~mjkay/LifeOfLanguage.pdf

[2] The problematic nature of this distinction is in full view in the Wikipedia entries for “Computational linguistics” and “Natural language processing”, respectively:

https://en.wikipedia.org/wiki/Computational_linguistics

https://en.wikipedia.org/wiki/Natural_language_processing

You need to consult the Talk pages for each entry, where you will find discussions about the merits of merging the two entries into one. Those discussions reference both intellectual history and the objectives and conceptual nature of the two disciplines.

[3] My oldest post labeled “digital humanities” is from July 2011, and is about a little study I undertook of paragraph length in Heart of Darkness, HD7: Digital Humanities Sandbox Goes to the Congo, http://new-savanna.blogspot.com/2011/07/hd7-digital-humanities-sandbox-goes-to.html

I don’t recall just what digital humanities work I was thinking about at the time–though it certainly wasn’t about paragraph length–but I note that I’ve been aware of humanities computing since my undergraduate years at Johns Hopkins in the late 1960s. I first posted about topic modeling on January 13, 2013, Topic Models: Strange Objects, New Worlds, http://new-savanna.blogspot.com/2013/01/topic-models-strange-objects-new-worlds.html

[4] Martin Kay, Zipf’s Law and L’Arbitraire du Signe, Linguistic Issues in Language Technology – LiLT, Vol. 6, issue 8, October 2011, 1-25. https://journals.linguisticsociety.org/elanguage/lilt/article/download/2584/2584-5332-1-PB.pdf

NEW SAVANNA

Pages in this blog

Monday, September 3, 2018

Computational linguistics & NLP: What’s in a corpus? – MT vs. topic analysis [#DH]

No comments:

Post a Comment