Friday, June 3, 2011

Statistics and Symbols in Mimicking the Mind

MIT recently held a symposium on the current status of AI, which apparently has seen precious little progress in recent decades. The discussion, it seems, ground down to a squabble over the prevalence of statistical techniques in AI and a call for a revival of work on the sorts of rule-governed models of symbolic processing that once dominated much of AI and its sibling, computational linguistics.

Briefly, from the early days in the 1950s up through the 1970s both disciplines used models built on carefully hand-crafted symbolic knowledge. The computational linguists built parsers and sentence generators and the AI folks modeled specific domains of knowledge (e.g. diagnosis in elected medical domains, naval ships, toy blocks). Initially these efforts worked like gang-busters. Not that they did much by Star Trek standards, but they actually did something and they did things never before done with computers. That’s exciting, and fun.

In time, alas, the excitement wore off and there was no more fun. Just systems that got too big and failed too often and they still didn’t do a whole heck of a lot.

Then, starting, I believe, in the 1980s, statistical models were developed that, yes, worked like gang-busters. And these models actually did practical tasks, like speech recognition and then machine translation. That was a blow to the symbolic methodology because these programs were “dumb.” They had no knowledge crafted into them, no rules of grammar, no semantics. Just routines the learned while gobbling up terabytes of example data. Thus, as Google's Peter Norvig points out, machine translation is now dominated by statistical methods. No grammars and parsers carefully hand-crafted by linguists. No linguists needed.

What a bummer. For machine translation is THE prototype problem for computational linguistics. It’s the problem that set the field in motion and has been a constant arena for research and practical development. That’s where much of the handcrafted art was first tried, tested, and, in a measure, proved. For it to now be dominated by statistics . . . bummer.

So that’s where we are. And that’s what the symposium was chewing over.

* * * * *

All that’s just a set-up for some slighly older observations by Martin Kay. Martin Kay is one of the grand old men of computational linguistics. He was on the machine translation team that David Hays assembled at RAND in the late 1950s and has done seminal work in the field. In 2005 the Association for Computational Linguistics gave him a lifetime achievement award. And he gave them an acceptance speech. Here’s a passage near the end of that speech, which is worth reading from start to end (PDF); Kay is talking about the statistical vs. the symbolic approach:
Now I come to the fourth point, which is ambiguity. This, I take it, is where statistics really come into their own. Symbolic language processing is highly nondeterministic and often delivers large numbers of alternative results because it has no means of resolving the ambiguities that characterize ordinary language. This is for the clear and obvious reason that the resolution of ambiguities is not a linguistic matter. After a responsible job has been done of linguistic analysis, what remain are questions about the world. They are questions of what would be a reasonable thing to say under the given circumstances, what it would be reasonable to believe, suspect, fear or desire in the given situation.
This, BTW, has come to be known as the common sense problem. Once AI starting trying to model how we reason about the world in general, it discovered that we had thousands and tens of thousands of little bits of knowledge we relied on all the time. Like, you know: rain is wet, being wet is often unpleasant, people don’t like to get wet, umbrellas keep the rain off you, so you don’t get wet, which you don’t like, and that’s why you took an umbrella with you when you went out because you looked out the door and saw dark clouds in the sky and clouds are a sign of rain meaning that you might get wet while walking to the grocery store so better have an umbrella with you. Like that. Just endless piles and piles of such utterly trivial stuff. All of which had to be carefully hand-coded into computerese. And, while the knowledge itself is trivial, the hand-coding is not. And so, as I indicated above, that particular enterprise ground to a halt.

Kay continues:
If these questions are in the purview of any academic discipline, it is presumably artificial intelligence. But artificial intelligence has a lot on its plate and to attempt to fill the void that it leaves open, in whatever way comes to hand, is entirely reasonable and proper. But it is important to understand what we are doing when we do this and to calibrate our expectations accordingly. What we are doing is to allow statistics over words that occur very close to one another in a string to stand in for the world construed widely, so as to include myths, and beliefs, and cultures, and truths and lies and so forth.
That, I believe, is a very important point. We’re using statistics about actual language use, based on crunching billions of words of text, as a proxy for detailed and systematic knowledge of the world. How do people get such knowledge? First, through living in the world, perceiving it, moving in it, doing things. And then there’s book learning, which builds on a necessary foundation of direct physical experience.

Kay concludes his thought:
As a stop-gap for the time being, this may be as good as we can do, but we should clearly have only the most limited expectations of it because, for the purpose it is intended to serve, it is clearly pathetically inadequate. The statistics are standing in for a vast number of things for which we have no computer model. They are therefore what I call an “ignorance model”.
And that’s where we are. The question is: How can we do better? As far as I can tell, there’s no obvious answer to that question. As far as I can tell, we somehow need to get all that commonsense knowledge into computerese. I don’t think we know how to do it.

I don’t think we can hand-code it. For reasons I couldn’t explain very well even if I tried, I don’t think hand-coding can, even in principle, be very effective, no matter how clever the formalism, nor how diligent the coders. The machine is going to have to acquire that knowledge by learning it. We can hand-code the learning device, but then we’re going to have to set it free in the world and let it learn for itself.

We can certainly think about doing such things. Indeed, we are doing them in limited yet often interesting ways. But full-scale, all-out, multi-modal (seeing, hearing, touching, handling, smelling) common sense knowledge of the world. Nope, we’re not there year.

But we can dream, and we can scheme. And the obvious thing to scheme about is linking the symbolic and statistical approachs into a single system. Ideas anyone?


  1. I'm not opposed to the hand-coding approach. After all, that's what nature did, in a sense.

    Hoping to build a creature that can learn it all in its own lifetime is too big a burden for any real animal, and so too big a burden on AI. We need, I think, to pack our AI with algorithms for solving each function it must do (from, ahem, the teleome).

  2. Basically, I agree. If the (artificial or real) creature's going to learn, it needs to come pre-equipped with the right processes and structures.

  3. Actually, statistical MT has moved more and more into what used to be the knowledge-based domains. You see factored methods (POS-tagging), tree-to-tree-alignment (CFG's), and lots of research on example-based methods.

    See also where the top systems are

    - Systran (rule-based) + statistical post edition


    - SMT where the phrase-pairs are computed with the help of Apertium transfer rules

    or where the top system was a

    - Combination of phrase based SMT, Factored model, Apertitum RBMT and Marclator (Marker based EBMT system).

    Of course the majority of the systems on the rankings are 'all statistical', but the best ones are quite more knowledgable than the plain N-gram models of the early nineties.

    One of the very good things that have come out of these SMT years is the acknowledgement that if you make a system that's perfect for a small test set of "linguists examples", then that test set is all that system can ever be good for. RBMT systems nowadays start from the corpus; going frequency first, whether it's adding words, transfer rules or disambiguation rules. Zipf's Law is no longer a swearword. A knowledge-rich system based on real theoretical insight doesn't have to be -- and shouldn't be -- constrained to theoretical examples.