Saturday, October 5, 2024

Do LLMs memorize? 25% of "memorized" tokens are actually predicted using general language modeling features

Abstract of the linked article:

Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications. Much prior work has studied such verbatim memorization using observational data. To complement such work, we develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences. We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to verbatim memorize sequences, even for out-of-distribution sequences; (3) the generation of memorized sequences is triggered by distributed model states that encode high-level features and makes important use of general language modeling capabilities. Guided by these insights, we develop stress tests to evaluate unlearning methods and find they often fail to remove the verbatim memorized information, while also degrading the LM. Overall, these findings challenge the hypothesis that verbatim memorization stems from specific model weights or mechanisms. Rather, verbatim memorization is intertwined with the LM's general capabilities and thus will be very difficult to isolate and suppress without degrading model quality.

I have a paper that deals with so-called memorization: Discursive Competence in ChatGPT, Part 2: Memory for Texts, Version 3.

Music and the Origins of Language: Neil deGrasse Tyson talks with Daniel Levitin

From the YouTube page:

Did early humans sing before they could talk? Neil deGrasse Tyson and Chuck Nice discover how music helps us recall memories, the singing Neanderthal theory. the default mode network, and how music can be used as medicine with neuroscientist and bestselling author, Daniel Levitin.

Would we have been able to communicate with aliens using music like in Close Encounters of a Third Kind? We explore Levitin’s new book I Heard There Was A Secret Chord which explores how music not only enriches our lives but also impacts our brains, behavior, and health.

We discuss how music can be a source of pleasure and how it captivates us—ever wonder why certain songs get stuck in your head? We explore how music has been a critical form of communication for thousands of years, predating written language, and how it helps encode knowledge and transmit information across generations. From ancient bone flutes to modern-day symphonies, why does music hold such a powerful place in human history?

We also dig into music's therapeutic powers—how it can boost cognitive reserves, help Parkinson's patients walk, relieve pain, and even enhance memory. Did you know that music has the power to activate every part of your brain? Whether you're soothing a baby with a lullaby or summoning old memories through a favorite song, the impact of music is profound. Levitin explains how music therapy is being explored as a potential solution to alleviate neurological afflictions like multiple sclerosis and Tourette syndrome.

Learn about the relationship between music and the brain’s "default mode network"—the state your brain enters when it’s at rest or wandering. We explore memory retrieval and how it’s tied to music’s ability to trigger unique, specific memories.

Discover why certain songs can transport us back to vivid moments in our past, acting as powerful cues for recalling experiences. We discuss how music persists beyond memory-related conditions like Alzheimer's, as seen in the case of Tony Bennett, who, despite the progression of the disease, retained the ability to perform his beloved songs. This connection between music, memory, and neural activation offers exciting possibilities for therapeutic applications in the future.

Timestamps:

00:00 - Introduction: Daniel Levitin
2:55 - Communicating to Aliens Using Music
6:12 - The Evolution of Music & Singing Neanderthal Theory
11:55 - Music v. Communication
15:45 - Neuroscience of Music & Memory Retrieval
24:34 - The Default Mode Network
28:24 - Music as Medicine
42:13 - How Does Memory Work?

Friday, October 4, 2024

How might LLMs store facts? [Multilayer Perceptrons, MLP]

Time stamps:

0:00 - Where facts in LLMs live
2:15 - Quick refresher on transformers
4:39 - Assumptions for our toy example
6:07 - Inside a multilayer perceptron
15:38 - Counting parameters
17:04 - Superposition
21:37 - Up next

Preceding videos in this series:

Thursday, October 3, 2024

On the dockworkers strike, labor on the rise

Sohrab Ahmari, In Praise of the Dockworkers Shutting Down Our Ports, The Free Press, October 2, 2024.

The International Longshoremen’s Association, whose strike is crippling U.S. ports from the Gulf Coast to New England, may not seem like the wretched of the Earth. They’re asking for a 77 percent pay increase on top of the $39 per hour those on the top tiers already make. The union’s president, Harold Daggett, earns $728,000 a year and once owned a 76-foot boat. With major disruptions looming, no wonder even some of those Americans ordinarily sympathetic to organized labor might be thinking, Okay, this is going too far. The less sympathetic are already calling for the Marines to suppress the strike.

But here’s the hard truth: The militancy showcased by the ILA is exactly what is needed to restore a fairer, more balanced economy—the kind that created the middle class in the postwar decades and allowed your grandparents to access reliable healthcare, take vacations, and enjoy disposable incomes. Those who complain that today’s left has come to privilege boutique identity politics over bread-and-butter concerns should cheer the longshoremen. There is nothing “woke” about their exercise of economic power to win material gains for themselves and their industrial brethren.

The longshoremen are striking for familiar reasons: better wages and benefits, and to prevent automation from decimating their livelihoods. [...]

Some critics argue that the ILA’s demand that no automation take place at the ports is unreasonably rigid. It’s certainly audacious, but it’s called an opening gambit for a reason. I suspect we will see concessions on both sides leading to a reasonable settlement, as in the case of SAG. The rest—gripes about how much the ILA president earns or how longshoremen are already well-compensated—is the tired propaganda of the C-suite class. [...]

The ILA strike is a rare reminder of working people’s power to shut it all down. [...] Real progress in market societies results from precisely this dynamic tension between labor and capital. For too long, however, one side of the equation—labor—has been torpid, not to say dormant. The asset-rich had it so good over the past few decades—capturing the lion’s share of the upside from de-unionization, financialization, and offshoring, as wages stagnated for the bottom half—that they all but forgot what labor militancy can look and sound like. How much it can sting.

Now, the labor movement is on the move. Since the pandemic, workers across a wide range of industries have joined arms to form new unions or to secure better wages and working conditions under existing collective-bargaining agreements. Last year, some 539,000 workers were involved in 470 strikes and walkouts, according to Cornell researchers, up from 140,000 workers mounting 279 strikes in 2021. This ferment—what one labor scholar has called a “strike wave”—comes after the union share of the private-economy workforce has declined from its peak of one-third in 1945 to 6 percent today.

There’s more at the link.

Blown-out flicks of flowers

Problems with so-called AI scaling laws

Arvind Narayanan and Sayash Kapoor, AI Scaling Myths, AI Snake Oil, June 27, 2024. The introduction:

So far, bigger and bigger language models have proven more and more capable. But does the past predict the future?

One popular view is that we should expect the trends that have held so far to continue for many more orders of magnitude, and that it will potentially get us to artificial general intelligence, or AGI.

This view rests on a series of myths and misconceptions. The seeming predictability of scaling is a misunderstanding of what research has shown. Besides, there are signs that LLM developers are already at the limit of high-quality training data. And the industry is seeing strong downward pressure on model size. While we can't predict exactly how far AI will advance through scaling, we think there’s virtually no chance that scaling alone will lead to AGI.

Under the heading, "Scaling “laws” are often misunderstood", they note:

Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases.

Emergence is not governed by any law-like behavior. It is true that so far, increases in scale have brought new capabilities. But there is no empirical regularity that gives us confidence that this will continue indefinitely.

Why might emergence not continue indefinitely? This gets at one of the core debates about LLM capabilities — are they capable of extrapolation or do they only learn tasks represented in the training data? The evidence is incomplete and there is a wide range of reasonable ways to interpret it. But we lean toward the skeptical view.

There is much more under the following headings:

• Trend extrapolation is baseless speculation
• Synthetic data is not magic
• Models have been getting smaller but are being trained for longer
• The ladder of generality

These remarks are from the section on models getting smaller:

In other words, there are many applications that are possible to build with current LLM capabilities but aren’t being built or adopted due to cost, among other reasons. This is especially true for “agentic” workflows which might invoke LLMs tens or hundreds of times to complete a task, such as code generation.

In the past year, much of the development effort has gone into producing smaller models at a given capability level. Frontier model developers no longer reveal model sizes, so we can’t be sure of this, but we can make educated guesses by using API pricing as a rough proxy for size. GPT-4o costs only 25% as much as GPT-4 does, while being similar or better in capabilities. We see the same pattern with Anthropic and Google. Claude 3 Opus is the most expensive (and presumably biggest) model in the Claude family, but the more recent Claude 3.5 Sonnet is both 5x cheaper and more capable. Similarly, Gemini 1.5 Pro is both cheaper and more capable than Gemini 1.0 Ultra. So with all three developers, the biggest model isn’t the most capable!

Training compute, on the other hand, will probably continue to scale for the time being. Paradoxically, smaller models require more training to reach the same level of performance. So the downward pressure on model size is putting upward pressure on training compute.

Check out the newsletter, AI Snake Oil, and the book of the same title.

OpenAI’s $6.6B raise: What were they thinking?

Cory Weinberg, The Briefing: The Cynic’s Guide to OpenAI’s Megaround, The Information, Oct. 2, 2024:

The biggest question is: Will OpenAI ever be a good business? It’s debatable right now. At least on a sales multiple basis (13 to 14 times next year’s forecasted $11.6 billion revenue), some investors can justify it without embarrassment.

But investors in the latest round probably need OpenAI to eventually become a roughly $1 trillion company to get a strong return. That means at some point the startup will have to become a cash flow machine rather than a cash incinerator.

Of the seven companies with over $1 trillion in market cap currently, the median free cash flow from the past year was $57 billion. In that regard, OpenAI, which is chasing growth and spending heavily on computing capacity, has quite a way to go. (For what it’s worth, Fidelity investing in the latest round should mean we get a regular check-in on how OpenAI’s valuation is shifting, at least in the opinion of Fidelity, which needs to make its startup valuations public.)

To be sure, even if OpenAI’s latest benefactors don’t believe it can get to $1 trillion, many of them have all sorts of ulterior, strategic reasons to back the startup.

Monday, September 30, 2024

The formal structure of “And Your Bird Can Sing”

Around the corner at Crooked Timber Belle waring has an interesting post, Final Choruses and Outros Apparently. Her first example is a Beatles tune from Rubber Soul (1966), “And Your Bird Can Sing.” If you like to listen to music analytically – which I do, though not all the time, not at all! – it can throw you for a loop. You think you know what’s going on, then it goes sideways and you don’t know where you are. Just when you’re about to give up, it catches you and you are THERE.

OK, so I listened to "And Your Bird Can Sing" and took notes. I think it goes like this:

1) 4 bar instrumental (parallel guitar lines)
2) A-strain, 8 bars
3) A-strain, 8 bars
4) B-strain, 8 bars
5) 8 bar instrumental
6) B-strain, 8 bars
7) A-strain, 8 bars
8) instrumental outro, 12 bars

We start with a parallel guitars line (played by George and Paul) which is used in various ways. Up though #4 it could be a standard AABA tune, like “I Got Rhythm”. Now, if that's what was going on, we'd go back to the A-strain.

But that's not what happens, not at all. Instead we get those parallel guitars, and not for 4 bars, but for 8. Then we get a repetition of the B-strain. And that, in turn, is followed by (a return to) the A-strain, with added harmony line. It ends with an extended version of the parallel guitars line.

I suppose we can think of it as a variation on the AABA tune where the B section (often called the bridge) is extended. What makes this extended bridge (sections 4, 5, and 6) particularly interesting is the inclusion of that purely instrumental line in the middle (section 5). That’s a bit disorienting. Where are we? Are we going way back to the intro, even before the beginning of the song proper? Not really. But it really isn’t until we return to the final repetition of the A-strain (with added harmony) that our equilibrium is restored: Now I know where we are.

Those parallel guitar lines are quite striking and stand in contrast to the A and B strains, which carry the lyrics. The Wikipedia entry for the song, which is interesting and worth a read, quite properly noted that it anticipates a “type of pop-rock arrangement would later be popularised by Southern rock bands such as the Allman Brothers Band and Lynyrd Skynyrd, as well as hard rock and metal acts such as Thin Lizzy, Boston and Iron Maiden.” 

* * * * *

Here's a recent cover version by musicians you’ve probably never heard of. Notice that one guitarist (Josh Turner) plays the parallel lines originally played by George and Paul.

  * * * * *

For extra credit. Here’s a different, and I believe earlier, version by the Beatles. The structure is somewhat different. Setting aside the laughter and whistling, what are the formal differences?

FutureWorld on the Hudson?

Wolfram on Machine Learning

Wolfram has a post in which he reflects on the work he’s done in the last five years: Five Most Productive Years: What Happened and What’s Next. On ChatGPT:

So at the beginning of February 2023 I decided it’d be better for me just to write down once and for all what I knew. It took a little over a week [...]—and then I had an “explainer” (that ran altogether to 76 pages) of ChatGPT.

Partly it talked in general about how machine learning and neural nets work, and how ChatGPT in particular works. But what a lot of people wanted to know was not “how” but “why” ChatGPT works. Why was something like that possible? Well, in effect ChatGPT was showing us a new science discovery—about language. Everyone knows that there’s a certain syntactic grammar of language—like that, in English, sentences typically have the form noun-verb-noun. But what ChatGPT was showing us is that there’s also a semantic grammar—some pattern of rules for what words can be put together and make sense.

My version of “semantic grammar” is the so-called “great chain of being,” which is about conceptual ontology, roughly: “rules for what words can be put together and make sense.” Here’s a post where I discuss it on the context of Wolfram’s work: Stephen Wolfram is looking for “semantic grammar” and “semantic laws of motion” [Great Chain of Being].

A bit later Wolfram says a bit more about what he’s recently discovered about the “essence of machine learning”:

So just a few weeks ago, starting with ideas from the biological evolution project, and mixing in some things I tried back in 1985, I decided to embark on exploring minimal models of machine learning. I just posted the results last week. And, yes, one seems to be able to see the essence of machine learning in systems vastly simpler than neural nets. In these systems one can visualize what’s going on—and it’s basically a story of finding ways to put together lumps of irreducible computation to do the tasks we want. Like stones one might pick up off the ground to put together into a stone wall, one gets something that works, but there’s no reason for there to be any understandable structure to it.

And the future? Among other things: “symbolic discourse language”:

But finally there was blockchain, and with it, smart contracts. And around 2015 I started thinking about how one might represent contracts in general not in legalese but in some precise computational way. And the result was that I began to crispen my ideas about what I called “symbolic discourse language”. I thought about how this might relate to questions like a “constitution for AIs” and so on. But I never quite got around to actually starting to design the specifics of the symbolic discourse language.

But then along came LLMs, together with my theory that their success had to do with a “semantic grammar” of language. And finally now we’ve launched a serious project to build a symbolic discourse language. And, yes, it’s a difficult language design problem, deeply entangled with a whole range of foundational issues in philosophy. But as, by now at least, the world’s most experienced language designer (for better or worse), I feel a responsibility to try to do it.

In addition to language design, there’s also the question of making all the various “symbolic calculi” that describe in appropriately coarse terms the operation of the world. Calculi of motion. Calculi of life (eating, dying, etc.). Calculi of human desires. Etc. As well as calculi that are directly supported by the computation and knowledge in the Wolfram Language.

And just as LLMs can provide a kind of conversational linguistic interface to the Wolfram Language, one can expect them also to do this to our symbolic discourse language. So the pattern will be similar to what it is for Wolfram Language: the symbolic discourse language will provide a formal and (at least within its purview) correct underpinning for the LLM. It may lose the poetry of language that the LLM handles. But from the outset it’ll get its reasoning straight.

The symbolic discourse language is a broad project. But in some sense breadth is what I have specialized in. Because that’s what’s needed to build out the Wolfram Language, and that’s what’s needed in my efforts to pull together the foundations of so many fields.

Thursday, September 19, 2024

Aaron Sorkin: As a fictional president, Trump would be "simply implausible"

Marc Tracy, Aaron Sorkin Thinks Life Still Imitates ‘The West Wing’, NYTimes, Sept. 19, 2024.

We are speaking to each other the day after the only scheduled debate between the two presidential candidates this year.

If I had scripted last night’s debate, you would have said that I made Kamala Harris fight a straw man. A lot [of shows and movies are] going to be written about this time that we’re living in now. But my prediction is that you’ll never see Donald Trump as anything but an offscreen character. You’ll see him on a television set on the news. Because he is simply implausible.

There is a movie coming out about Trump, but to your point, it is set 40 years ago.

Sebastian Stan is playing Trump in the ’70s and ’80s. I mean President Trump. Even saying it doesn’t really sound right.

It has been a pretty dramatic summer politically. What have you made of it?

Over the years, cable newscasters have used the phrase “‘West Wing’ moment,” as in: “There’s a clash over the debt ceiling. There’s not going to be a ‘West Wing’ moment.” They’ve used that to mean: an unrealistically high expectation of character triumphing over selfishness, and in the real world, there are not “‘West Wing’ moments.” I believe that the morning Biden stepped out of the race, that was a “West Wing” moment. That’s the kind of thing we write stories about.

Boats lined up on a pier

Wednesday, September 18, 2024

Emergence

Sunday, September 15, 2024

LLMS are not fundamentally about language [Karpathy]

Note that some time ago I pointed out that transformers would operate in the same way on strings of colored beads as they do on strings of word tokens.