I'm having fun listening to this conversation between Bob Wright and Timothy Lee:
I'm finding the stuff on how transformers work particularly useful.
TOC:
0:05 Tim’s article explaining large language models
2:55 Do LLMs reverse engineer the human mind?
10:56 GPT’s mesmerizing emergent properties
17:15 The ‘T’ in Chat-GPT
30:05 How AI models evolve during training
41:31 Human intelligence compared to artificial intelligence
Check out Tim's longish paper written with Sean Trott (looks good): Large language models, explained with a minimum of math and jargon. From the article:
Research suggests that the first few layers focus on understanding the syntax of the sentence and resolving ambiguities like we’ve shown above. Later layers (which we’re not showing to keep the diagram a manageable size) work to develop a high-level understanding of the passage as a whole.
For example, as an LLM “reads through” a short story, it appears to keep track of a variety of information about the story’s characters: sex and age, relationships with other characters, past and current location, personalities and goals, and so forth.
Researchers don’t understand exactly how LLMs keep track of this information, but logically speaking the model must be doing it by modifying the hidden state vectors as they get passed from one layer to the next. It helps that in modern LLMs, these vectors are extremely large.
For example, the most powerful version of GPT-3 uses word vectors with 12,288 dimensions—that is, each word is represented by a list of 12,288 numbers. That’s 20 times larger than Google’s 2013 word2vec scheme. You can think of all those extra dimensions as a kind of “scratch space” that GPT-3 can use to write notes to itself about the context of each word. Notes made by earlier layers can be read and modified by later layers, allowing the model to gradually sharpen its understanding of the passage as a whole.
So suppose we changed our diagram above to depict a 96-layer language model interpreting a 1,000-word story. The 60th layer might include a vector for John with a parenthetical comment like “(main character, male, married to Cheryl, cousin of Donald, from Minnesota, currently in Boise, trying to find his missing wallet).” Again, all of these facts (and probably a lot more) would somehow be encoded as a list of 12,288 numbers corresponding to the word John. Or perhaps some of this information might be encoded in the 12,288-dimensional vectors for Cheryl, Donald, Boise, wallet, or other words in the story.
The goal is for the 96th and final layer of the network to output a hidden state for the final word that includes all of the information necessary to predict the next word.
It makes sense to me that the first few layers of the model work on syntax while the later layers work over larger stretches of text. Why does that make sense to me?
One, because sentence-level syntax is a subject and domain of its own, quite different from larger discourse structures. They are studied differently. Linguists have focused on sentence-level syntax and have many well-developed, but not mutually coherent, accounts of how this goes. Discourse is a different kind of topic requiring different methods, though back in the 1970s, I believe it was, there were attempts to extent the structures of sentence-level syntax to whole discourse. It was called, naturally enough, text grammar. AFIK it never really went anywhere.
Secondly, that distinction, between sentence-level syntax, and overall discourse structure pops out rather dramatically in my work on ChatGPT and stories. In particular, consult my working paper, ChatGPT tells stories, and a note about reverse engineering: A Working Paper, Version 3.
More later.
No comments:
Post a Comment