Wednesday, April 10, 2024

AI, Chess, and Language 3.2: Search! The case of language

In the previous post in this series I considered the issue of search in the context of chess: AI, Chess, and Language 3.1: Search! What enables us to entertain the idea that chess is a paradigmatic case of cultural evolution? Now I want to look at search in the case of language and texts, which is quite different, as language presents computational problems which are quite different from those of chess.

In particular, as I pointed out in the first post in this series, chess and language have very different geometric footprints in the world. The geometric footprint of chess is quite simple and sharply specified, being limited to the board and the pieces. The geometric footprint of natural language is quite different. Since it must deal with the world in an open-ended way, its geometric footprint must encompass the world. The upshot of this difference is that, which it is possible to list all possible chess games and organize them into a tree (at least in principle, if not in actuality), there doesn’t seem to be any way to list all possible intelligible texts and no principled way to organize them. Note that I’m not talking about listing all possible combinations of words, as most of those combinations would be nonsense. I’m talking about intelligible texts, which is quite a different matter and is furthermore subject to change as language changes.

Nonetheless I think it is instructive to consider two very different cases, parsing syntax and document retrieval.

Parsing Syntax

For several decades now the study of language has focused on syntax: What are the rules that govern sentences. How can we characterize an infinite set of sentences using a finite set of syntactic rules. One aspect of the rules is a finite list of word types: nouns, verbs, adjects, and the like. Then we have a finite set of rules that govern how word types can be organized into sentences.

Why is it important to have a finite set of word types? While the vocabulary of a language is finite at any given moment in time, that vocabulary is subject to change, and may well grow over time. If however we can classify every word as being an example of some particular word type, and allowing that some words may function in two or more types, then the actual composition of the vocabulary at any one time doesn’t matter. We can define syntax in terms of word types alone.

In computational linguistics parsing is the process of searching through the universe of possible sentence descriptions for one the matches a given sentence. A wide variety of strategies have been developed for parsing, with varying degrees of success. They tend to be subject to problem of combinatorial explosion, as sentences become longer and more complex, the number of alternatives that must be tried increases, and tends to do so exponentially. Nonetheless parsers have been developed that are useful and usable in various contexts.

However, all we get from this are accounts of why this or that sentence is permissible. The following sentence is grammatically correct, but meaningless:

Colorless green ideas sleep furiously.

The is, of course, Chomsky’s famous example which he introduced in Syntactic Structures to make the point that syntax is independent of semantics.

The poet John Hollander made that sentence tractable by adding two lines before it so that the three lines together constitute a (meaningful) poem:

Curiously deep, the slumber of crimson thoughts:
While breathless, in stodgy viridian
Colorless green ideas sleep furiously.

That is a matter of semantics, and semantics seems to be ill-defined and unbounded.

Document Retrieval

Now let’s think about going to a library to find a book. Each book is assigned a number and the books are placed on shelves according to number. To find a book you need to know its number.

How do you find that? In the old days libraries had card catalogues, which still exist. One catalog would be organized according to author’s names; another was organized by book titles; and a third was organized according to subject category. If you knew either the title or the author’s name, you would look them up in the appropriate catalogue to find the book number. If you simply wanted to search by topic, you would go to the subject catalog and start browsing through it.

Electronic library catalogues have been available for decades. I can go online and do an electronic search through the catalogue of my local library. Author, title, and subject searches are available, and I can also search by keyword.

Starting back in the 1970s, however, Gerard Salton and his colleagues developed more sophisticated methods were developed for searching collections of electronic documents. Such documents typically had an abstract associated with them; the abstract gave a short description of the document. By representing the abstract as a vector in a high-dimensional space of words, it became possible to search for documents simply by presenting a query in natural language. The query would be transformed into a vector and the vector would then be matched against the vectors for the document collection. A list of the documents with the closest matches would be returned to the user.

In this case the documents in a collection were being directly queried by their contents. That’s quite different for searching a card catalog the identifying numbers of books which could then be used to locate books on shelves. The book location system is fundamentally external to the contents of the books. Such a separation between location and content doesn’t exist in the kind of system Salton developed. This is the origin of the vector-based semantics that underlies current work machine learning.

Interesting, but it’s not like chess

There’s a lot more I could have said about language and computing, but this is enough to make my basic point, which is that language and chess are very different computationally. And that difference can be traced back in large part to their geometric footprints in the world. That takes me back to the end of the first post in this series, where I mentioned the work of Miriam Yevick. In her 1975 paper Yevick was specifically interested in the relationship between the nature of the objects that were the object of computation and the procedures used to accomplish the computation:

Miriam Yevick, Holographic or Fourier Logic, Pattern Recognition 7, 1975, 187-213, https://doi.org/10.1016/0031-3203(75)90005-9.

She considered geometrically complex objects, such as Chinese characters or any number of natural objects, on the one hand and geometrically simple objects, such as the objects studied in plane geometry. The former required holographic logic, in her terminology, while the latter could use ordinary propositional calculus.

Can her argument be extended to the difference between chess and language, where the former is considered to be geometrically simple and the latter geometrically complex?

No comments:

Post a Comment