Friday, April 19, 2024

Friday Fotos: Hoboken and the Hudson

Will Team LLM ever catch up to Team Atlas? [+ escape from a maze as a facilitating analogy]

I don't know when I first saw a video of Atlas. But whenever it was, I'm sure I was astounded. As astounded as I was with ChatGPT? I don't remember. And of course, I couldn't play around with Atlas. In any case, that would have been much more difficult than playing around with ChatGPT.

I don't know when I first saw a video of Atlas. But whenever it was, I'm sure I was astounded. As astounded as I was with ChatGPT? I don't remember. And of course, I couldn't play around with Atlas. In any case, that would have been much more difficult than playing around with ChatGPT.

The thing is, the researchers who built Atlas had to develop a profound understanding of the dynamics of humanoid motion. In contrast, the reseachers who work on LLMs don't need to know much of anything about language. That's worth thinking about.

Consider, for example, these remarks that a pioneering computational linguist, Martin Kay, made awhile back:

Symbolic language processing is highly nondeterministic and often delivers large numbers of alternative results because it has no means of resolving the ambiguities that characterize ordinary language. This is for the clear and obvious reason that the resolution of ambiguities is not a linguistic matter. After a responsible job has been done of linguistic analysis, what remain are questions about the world. They are questions of what would be a reasonable thing to say under the given circumstances, what it would be reasonable to believe, suspect, fear, or desire in the given situation. [...] What we are doing is to allow statistics over words that occur very close to one another in a string to stand in for the world construed widely, so as to include myths, and beliefs, and cultures, and truths and lies and so forth. As a stop-gap for the time being, this may be as good as we can do, but we should clearly have only the most limited expectations of it because, for the purpose it is intended to serve, it is clearly pathetically inadequate. The statistics are standing in for a vast number of things for which we have no computer model. They are therefore what I call an “ignorance model.”

LLMs did not exist in 2005, when Kay made those remarks. As he died in 2021, before the release of ChatGPT, I don't know how it would have reacted to it. I see little reason to alter those remarks. Perhaps we should remove the word “clearly,” and perhaps “pathetically” as well. But LLMs are still inadequate models of human linguistic behavior and the industry’s current infatuation with them is perhaps an ironic  testament to the cliché that ignorance is bliss.

Still, I do think that LLMs have an important role in developing a detailed understanding how language works. Language is grounded in the operations of the human brain. Our ability to probe and maniuplate the brain is quite limited. That is not the case with LLMs. Here’s a faciliting analogy I am working on for a report I am preparing on my work with ChatGPT over the last year. I’m talking about using ChatGPT to generate stories:

The model is structured such that, when it starts generating a text from a certain location in its activation space, it will have created a coherent text – a story in this case, word-by-word, by the time it exits that region of the space.

As a crude analogy, consider what is called a simply connected maze, one without any loops. If you are lost somewhere in such a maze, no matter how large and convoluted it may be, there is a simple procedure you can follow that will take you out of the maze. You don’t need to have a map of the maze; that is, you don’t need to know its structure. Simiply place either your left or your right hand in contact with a wall and then start walking. As long as you maintain contact with the wall, you will find an exit. The structure of the maze is such that that local rule will take you out.

“Produce the next word” is certainly a local rule. The structure of LLMs is such that, given the appropriate context – a prompt asking for a story, following that rule will produce a coherent a story. Given a different context, that is to say, a different prompt, that simple rule will produce a different kind of text.

Now, let’s push the analogy to the breaking point: We may not know the structure of LLMs, but we do know a lot about the structure of texts, from phrases and sentences to extended texts of various kinds. In particular, the structure of stories has been investigated by students of several disciplines, including folklore, anthropology, literary criticism, linguistics, and symbolic artificial intelligence. Think of the structures proposed by those disciplines as something like a map of the maze in our analogy.

Unfortunately, students of those various disciplines have not reached a consensus on how to characterize those structures. Linguists are entertaining a variety of proposals about the nature of sentence-level syntax and students of those other disciplines haven’t converged on a way to describe story structure. Still, we have a starting point for constructing our story maps, even if it is somewhat confused and ambiguous.

If we are to exploit LLMs in ways that analogy suggests, then we are going to have to use symbolic models to do it. First we propose a symbolic model for some aspect of the structure we know how to probe or manipulate. Then see whether an appropriately prepared LLM behaves in the way our model predicts that it should. If it doesn’t, then we revise and repeat.

Iteratively.

Again,

and again,

and again....

Thursday, April 18, 2024

More pansies

Another Crazy Interview: Mark Zuckerberg

YouTube copy:

8,847 views Apr 18, 2024 Dwarkesh Podcast
Zuck on:

- Llama 3
- open sourcing towards AGI
- custom silicon, synthetic data, & energy constraints on scaling
- Caesar Augustus, intelligence explosion, bioweapons, $10b models, & much more

Enjoy!

Timestamps

00:00:00 Llama 3
00:09:15 Coding on path to AGI
00:26:07 Energy bottlenecks
00:34:03 Is AI the most important technology ever?
00:38:04 Dangers of open source
00:54:40 Caesar Augustus and metaverse
01:05:36 Open sourcing the $10b model & custom silicon
01:16:02 Zuck as CEO of Google+

I don’t know what to make of this. Zuckerberg’s a smart guy. As founder and CEO of Meta (formerly Facebook) he’s also rich and powerful. I take it as self-evident that there’s some kind of connection between being a smart guy and whatever/however he became rich and powerful. I also take it as self-evident that the path that led to being rich and powerful was touched by more than a little luck.

When he talks about energy bottlenecks on the way to more and more compute for whatever, I figure he more or less knows what he’s talking about. That’s a thinkable problem and he’s got smart staff who can dig into all the details and advise him.

And when he talks about AI being the most important technology ever, now things get tricky. He obviously thinks it is. Lots of people think that, or something close to it. I’m one of them. But beyond that, just why that’s the case and what it means for the future, who knows? But some people have thought about that thing more deeply than others, much more deeply.

How deeply has Zuckerberg thought about it? How deeply could he have possibly thought about it? He dropped out of Harvard in his sophomore year to run his company. He’s been running it ever since. That doesn’t give him much to read deeply in a wide range of subjects, philosophy, cultural evolution, cognitive science, the history of science and technology, anthropology and so forth and so on. I believe at one time he set out to visit all 50 states in America, so he spent a lot of time traveling. He probably had some time to read. Did he read up on everything that’s relevant to thinking about the history of humankind? But how much could he have possibly read?

Nor is it a matter of just reading. You have to think about it. And to really think you need to write and discuss. How much of that has he done on those kinds of subjects?

And yet now he’s having a conversation with Dwarkesh Patel on really Big Picture Issues. It sounds to me like he’s mostly just making stuff up. If he were an A.I. we’d say he’s hallucinating, confabulating. But what else can he do?

Note that I say this, not in a spirit of criticizing Zuckerberg, or, for that matter, of Patel. I’m writing in in a spirit of observation. THAT’s what they’re doing.

Do they HAVE to do it? Well, Dwarkesh has more leeway than Zuckerberg. Dwarkesh is just a podcaster. He’s got to get clicks, and he’s in a position to attract interviews that bring him clicks. Nothing much depends on his interviews in any direct way. 

 But a great deal depends on the decisions Zuckerberg makes about Meta, over which he seems to have extraordinary control. And the nature of Meta’s business is such that those Big Picture Issues bear on how Meta utilizes its resources. He may not have had time to think about those issues very deeply, but he has no choice but to make decisions of that kind.

That’s crazy.

Twigs and crane with flag

Peter Thiel on the Western canon and on A.I.

Tyler Cowen interviews Peter Thiel on various topics. Here’s the introduction:

In this conversation recorded live in Miami, Tyler and Peter Thiel dive deep into the complexities of political theology, including why it’s a concept we still need today, why Peter’s against Calvinism (and rationalism), whether the Old Testament should lead us to be woke, why Carl Schmitt is enjoying a resurgence, whether we’re entering a new age of millenarian thought, the one existential risk Peter thinks we’re overlooking, why everyone just muddling through leads to disaster, the role of the katechon, the political vision in Shakespeare, how AI will affect the influence of wordcels, Straussian messages in the Bible, what worries Peter about Miami, and more.

On the Western canon

COWEN: Are there other holy books besides the Bible that you draw ideas and inspiration from? And what would those be?

THIEL: I think in some sense, it’s all the great books. They’re not quite at the scale of these holy books, but there was a way that we treated Shakespeare or Cervantes or Goethe as these almost semi-divine writers, and I think that’s the attitude one has to have to read any of these books appropriately and seriously.

COWEN: So, the Western canon would be your answer, so to speak?

THIEL: Something like the Western canon. I don’t think the great books are quite as holy as the Bible, and as a result, I probably don’t read enough of them, but yes, that’s the closest approximation. COWEN: And it includes science fiction — yes or no?

THIEL: I read a lot as a kid. I read so little of that nowadays. It’s all too depressing.

That’s certainly how Harold Bloom thought about the canonical authors, as being “almost semi-divine.” We need to rethink that. Though it goes beyond rethinking. We have to rework the mechanisms of culture. How do we use these books most effectively? That’s tricky, since we don’t know what those mechanisms are or how they work. There’s no engineering solution.

Silicon Valley’s not asking the right questions about A.I.

COWEN: For our last segment, let’s turn to artificial intelligence. As you know, large language models are already quite powerful. They’re only going to get better. In this world to come, will the wordcels just lose their influence? People who write, people who play around with ideas, pundits — are they just toast? What’s this going to look like? Are they going to give up power peacefully? Are they going to go down with the ship? Are they going to set off nuclear bombs?

THIEL: I’ll say the AI thing broadly, the LLMs — it’s a big breakthrough. It’s very important, and it’s striking to me how bad Silicon Valley is at talking about these sorts of things. The questions are either way too narrow, where it’s something like, is the next transformer model going to improve by 20 percent on the last one or something like this? Or they’re maybe too cosmic, where it’s like from there we go straight to the simulation theory of the universe. Surely there are a lot of in-between questions one could ask. Let me try to answer yours.

My intuition would be it’s going to be quite the opposite, where it seems much worse for the math people than the word people. What people have told me is that they think within three to five years, the AI models will be able to solve all the US Math Olympiad problems. That would shift things quite a bit.

There’s a longer history I always have on the math versus verbal riff. If you ask, “When did our society bias to testing people more for math ability?” I believe it was during the French Revolution because it was believed that verbal ability ran in families. Math ability was distributed in this idiot savant way throughout the population.

If we prioritized math ability, it had this meritocratic but also egalitarian effect on society. Then, I think, by the time you get to the Soviet Union, Soviet Communism in the 20th century, where you give a number theorist or chess grandmaster a medal — which was always a part I was somewhat sympathetic to in the Soviet Union — maybe it’s actually just a control mechanism, where the math people are singularly clueless. They don’t understand anything, but if we put them on a pedestal, and we tell everyone else you need to be like the math person, then it’s actually a way to control. Or the chess grandmaster doesn’t understand anything about the world. That’s a way to really control things.

If I fast-forwarded to, let’s say, Silicon Valley in the early 21st century, it’s way too biased toward the math people. I don’t know if it’s a French Revolution thing or a Russian-Straussian, secret-cabal, control thing where you have to prioritize it. That’s the thing that seems deeply unstable, and that’s what I would bet on getting reversed, where it’s like the place where math ability — it’s the thing that’s the test for everything.

It’s like if you want to go to medical school, okay, we weed people out through physics and calculus, and I’m not sure that’s really correlated with your dexterity as a neurosurgeon. I don’t really want someone operating on my brain to be doing prime number factorizations in their head while they’re operating on my brain, or something like that.

In the late ’80s, early ’90s, I had a chess bias because I was a pretty good chess player. And so my chess bias was, you should just test everyone on chess ability, and that should be the gating factor. Why even do math? Why not just chess? That got undermined by the computers in 1997. Isn’t that what’s going to happen to math? And isn’t that a long-overdue rebalancing of our society?

Thiel’s right about Silicon Valley’s questions about A.I. As Thiel says, their questions are too narrowly technical or too “cosmic.” But it’s not at all clear to me how we’re going to get the rebalancing Thiel mentions.

There’s much more in the interview.

Wednesday, April 17, 2024

Rich people live there, yet the sky is angry for them as well

The problem with the AGI concept

Pink sakura

The Seven Year Itch, Mad Men, Sex and the City [Media Notes 117]

I was, say, half a season in to re-watching Sex and the City, which originally aired from 1998 to 2004, when I watched The Seven Year Itch, a Billy Wilder romantic comedy from 1955. Remember that iconic shot of Marilyn Monroe standing on a subway grate wearing a body-hugging white dress with the skirt billowing up around her? That’s from this film. Anyhow, “what an interesting juxtaposition,” thought I to myself, “both about sex in New York City, but separated by forty years.” Perhaps I should write something about that.

As I was thinking about that I thought that it would be interesting to throw Mad Men into the mix. It appeared in between 2007 and 2015, but is set mostly in New York City in the 1960s. That gives us a series set in time not long after The Seven Year Itch, but produced from a sensibility much closer to Sex and the City.

That would be a very interesting and, alas, complicated, three-way, certainly much more than I can mount in a single blog post. But then, merely suggesting the comparison is, say, and third of the job. And some broad comparisons are quickly sketched out.

We have to set aside the fact that Itch is a 105-minute movie depicting events that take place over two days while the other titles are multiple part series sunning for several years each and depicting several years of elapsed fictional time. But all three involve sex, and romance, though Mad Men is more broadly about American life and (upper) middle-class culture and business while Sex and the City is pretty much focused on the sexual lives of four single professional women.

The obvious contrasts: First, sexual mores were more open and freer in turn-of-the millennium New York City than they were in the mid-20th century, though that was beginning to change in the world depicted in Mad Men. Secondly, financially independent single professional women like the quartet in Sex and the City, were rare in mid-century American. Those stories couldn’t have happened back then. One of the major story lines in Mad Men centered on how Peggy Olson moved from the secretarial pool in the first season into copywriting and eventually into management. That is to say, she moves into the kind of positions all of the Sex and the City women have, but without the sexual latitude. They pretty-much assumed they belonged in such jobs.

Third: Contemporary mores allow for a much franker depiction of sexuality than was possible in 1955. It’s worth nothing that there was a one-night-stand in the play that The Seven Year Itch was derived from, but that was eliminated in the movie version. In both versions Tom Ewell plays Richard Sherman, a publishing executive in New York City. His wife and young son have left to spend the summer in Maine, where it’s much cooler. An un-named young woman, played by Marilyn Monroe, sublets the apartment above Ewell. He fantasizes about having an adulterous affair with her; they even go to see a movie (that’s when Monroe stands over a subway grate). Nothing comes of those fantasies (which are accompanied by Rachmaninoff’s Second Piano Concerto, a very 1950s touch) in the movie. I assume that the difference between the play and the movie on this matter is the difference between a (somewhat sophisticated) New York audience and a national audience. By contrast, adultery was a central motif in the world of Mad Men, which was not very far removed from the world from the Seven Year world. But the 21st century story-telling conventions of Mad Men permitted, even required, a sexual frankness that had to be suppressed for national audiences in the mid-1950s. Adultery also shows up in the sexually freer world depicted in Sex and the City, though not so centrally as in Mad Men.

While there’s much more that could be said, I’ll content myself with two observations. The first is about how Marilyn Monroe was presented, much like Christina Hendrix was presented as Joan Holloway in Mad Men. Just as Monroe was the sex goddess of mid-century America, so Holloway was the sex goddess of Mad Men.

Second, and lastly, John Slattery, who played, Roger Sterling, one of the central characters in Mad Men, shows up in minor role in season three of Sex and the City. He’s one of the lovers of Carrie Bradshaw (played by Sarah Jessica Parker). Episode 2, “Politically Erect,” of season three begins with this line in voice-over:

I had been dating a politician, Bill Kelley, for three weeks now. Since most of my time with him was spent on the campaign trail, I decided to dress the part. I found some vintage Halston and did a spin on Jackie Kennedy. The early years. [...] I figured we made a good match. I was adept at fashion. He was adept at politics.

Perfect. Fashion certainly existed in the 1950s, but I can’t imagine playing it front-and-center back then in the way that it is in Sex and the City.

Tuesday, April 16, 2024

To the Moon! [She's becoming a young woman]

The song was written in 1954 as "In Other Words." It was recorded by various artists and in 1963 Peggy Lee convinced the composer, Bart Howard, to change the name to "Fly Me to the Moon." An instrumental version by Joe Harnell became a hit in 1963 and won a Grammy, but it's the 1964 recording by Frank Sinatra that I'm most familiar with. Sinatra was backed by Count Basie, with an arrangement by Quincy Jones. That version subsequently became associated with the Apollo Space program.

The only thing I can tell you about this version is that the trumpeter is young Kwak DaKyung, whom I've been following for a number of years. The male vocalist owes a debt to Sinatra, though I don't think Sinatra ever performed in shades. The female vocalist is an excellent scat singer.

From the Wikipedia article:

By 1995, the song had been recorded more than 300 times. According to a poll conducted by Japanese music magazine CD&DL Data in 2016 about the most representative songs associated with the Moon, the cover versions by Claire Littley and Yoko Takahashi ranked 7th by 6,203 respondents. The Claire cover version won the Planning Award of Heisei Anisong Grand Prize among the anime theme songs from 1989 to 1999, following its appearance in the end credits of Neon Genesis Evangelion.

From Sinatra in 1964, to the moon a couple years later, to Japanese anime at the end of the century.

Now, sit back for a moment and ponder the fact that that very American song from the mid-20th Century is being performed in 2024 by a Korean jazz combo featuring a 14 year old trumpet player. I'm waiting for her to cut loose. I wonder when?

Spots of color on the Hudson

AI, Chess, and Language 3.3: Chess and language as models, a philosophical interlude

I now want to take a look at AI, chess, and language from a Piagetian point of view. While he is best known for his work in developmental psychology, Piaget was also interested in the development of concepts over historical time, which he called genetic epistemology, and more generally, in the construction of mental mechanisms. He was particularly interested in abstraction and in something he called reflective abstraction. The concept is a slippery one. Ernst von Glasersfeld has a useful account (Abstraction, Re-Presentation, and Reflection: An Interpretation of Experience and of Piaget’s Approach) from which I abstract, if you will, perhaps an over-simplified idea, which is that major steps in cognitive evolution involve taking a mechanism that is operative at one level and making it an object which is operated on by mechanisms at a new and higher level.

Let us take chess as an example. We know that all possible chess games can be arranged as a tree – something we examined earlier, Search! What enables us to entertain the idea that chess is a paradigmatic case of cultural evolution? But we have only known that since the mathematician Ernest Zermlo published “Über eine Anwendung der Mengenlehre auf die Theorie des Schachspiels” (“On an application of set theory to the theory of chess”) in 1913. Ever since the game emerged players have been exploring that tree, but without explicitly knowing it. It was only when Zermelo had published the idea that the chess tree became an object that could be explicitly examined had explored as such.

I don’t know when that idea crossed into the chess world. In a quick search I found out that Alexander Kotov used it a book, Think Like a Grand Master, which was translated into English in 1971. Kotive wrote of building an “analysis tree.” I assume that chess players became aware of the idea sooner than that, perhaps not long after Zermelo’s paper was published. In any event, for my present purposes, the date is irrelevant. What is important is simply that it happened. The tree structure has been central to all work in computer chess.

The tree structure is central to the activity of search. But there is more to chess than searching for possible moves. The moves must be evaluated and a strategy has to be executed. Various means have been developed to do those things with the result that computers can now play chess better than any human. Chess is a “solved” problem. And the components of various solutions are objects for explicit examination and design.

Almost.

Unlike earlier chess problems, which have been completed based on symbolic technology, some of the most recent chess programs, such as AlphaZero, use neural nets for the evaluation function. What those nets are doing is opaque. We know how to build them, but we don’t know what they do.

And that brings us to language.

Language has been investigated for centuries. More specifically, it has been subject to formal analysis for the last three quarters of a century, but cognitive scientists have come to little agreement about syntax much less semantics. Nonetheless large language models are now capable of very impressive language performance. Like all neural network models, however, these models are opaque. But what if we could figure out how they worked internally?

Consider following diagram, which I have commented about in my paper on GPT-3: GPT-3: Waterloo or Rubicon? Here be Dragons:

Texts are one product of the interaction of the human mind and the world. LLMs are trained on large bodies of these texts. It follows that the internal structure of these models must somehow, we don’t know how, reflect the nature of that interaction. If we could understand the internal structure of these models, wouldn’t that be a reflective abstraction over the processes of the human mind in the same way that the chess tree is a reflective abstraction over the human mind as it is engaged in the game of chess?

Yes, the chess tree is not all of chess, but only a part. And we know how to augment that part with evaluation functions. Figuring out how LLMs work would not be equivalent to knowing how the mind works, but might be a start. To know and be able to manipulate the conceptual structure that is latent in LLMs, that would be a major intellectual accomplishment.

Mimi and Eunice ride again

First iris of the year

I've been photographing irises since 2011. Having any show up in mid-April is unusual. I have an iris photo from April 30, 2021, and three photos from April 29, 2022. That's it. Otherwise, the earliest ones are in May.

Monday, April 15, 2024

The AI industry lacks useful ways of measuring performance [the boastful leading the blind]

Kevin Roose, A.I. Has a Measurement Problem, NYTimes, April 25, 2024.

There’s a problem with leading artificial intelligence tools like ChatGPT, Gemini and Claude: We don’t really know how smart they are.

That’s because, unlike companies that make cars or drugs or baby formula, A.I. companies aren’t required to submit their products for testing before releasing them to the public. There’s no Good Housekeeping seal for A.I. chatbots, and few independent groups are putting these tools through their paces in a rigorous way.

Instead, we’re left to rely on the claims of A.I. companies, which often use vague, fuzzy phrases like “improved capabilities” to describe how their models differ from one version to the next. And while there are some standard tests given to A.I. models to assess how good they are at, say, math or logical reasoning, many experts have doubts about how reliable those tests really are.

Safety risk:

Shoddy measurement also creates a safety risk. Without better tests for A.I. models, it’s hard to know which capabilities are improving faster than expected, or which products might pose real threats of harm.

In this year’s A.I. Index — a big annual report put out by Stanford University’s Institute for Human-Centered Artificial Intelligence — the authors describe poor measurement as one of the biggest challenges facing A.I. researchers.

“The lack of standardized evaluation makes it extremely challenging to systematically compare the limitations and risks of various A.I. models,” the report’s editor in chief, Nestor Maslej, told me.

Massive Multitask Language Understanding:

The MMLU, which was released in 2020, consists of a collection of roughly 16,000 multiple-choice questions covering dozens of academic subjects, ranging from abstract algebra to law and medicine. It’s supposed to be a kind of general intelligence test — the more of these questions a chatbot answers correctly, the smarter it is.

It has become the gold standard for A.I. companies competing for dominance. (When Google released its most advanced A.I. model, Gemini Ultra, earlier this year, it boasted that it had scored 90 percent on the MMLU — the highest score ever recorded.)

Dan Hendrycks, an A.I. safety researcher who helped develop the MMLU while in graduate school at the University of California, Berkeley, told me that the test was never supposed to be used for bragging rights. He was alarmed by how quickly A.I. systems were improving, and wanted to encourage researchers to take it more seriously.

Mr. Hendrycks said that while he thought MMLU “probably has another year or two of shelf life,” it will soon need to be replaced by different, harder tests. A.I. systems are getting too smart for the tests we have now, and it’s getting more difficult to design new ones.

Problems:

There may also be problems with the tests themselves. Several researchers I spoke to warned that the process for administering benchmark tests like MMLU varies slightly from company to company, and that various models’ scores might not be directly comparable.

There is a problem known as “data contamination,” when the questions and answers for benchmark tests are included in an A.I. model’s training data, essentially allowing it to cheat. And there is no independent testing or auditing process for these models, meaning that A.I. companies are essentially grading their own homework.

In short, A.I. measurement is a mess — a tangle of sloppy tests, apples-to-oranges comparisons and self-serving hype that has left users, regulators and A.I. developers themselves grasping in the dark.

There's more at the link.

Color me "not at all surprised." Not only does the field lack a sound theoretical basis, as far as I can tell, it doesn't even know that Hey! that might be useful at at time like this. I don't have a theory to hand over, thought I have a thought or three about how one might go about developing one, but then I'm not making (unfounded) performance claims either.

Without a coherent way of measuring performance, how can you guide the development of your products? Are we in Breugel-land, with the blind leading the blind?