Showing posts with label Sutskever. Show all posts
Showing posts with label Sutskever. Show all posts

Saturday, December 14, 2024

Ilya Sutskever: "Sequence to sequence learning with neural networks: what a decade" [No mas ?]

From the YouTube page:

Ilya Sutskever full talk "Sequence to sequence learning with neural networks: what a decade" at NeurIPS 2024 in Vancouver, Canada.

"Pre-training as we know it will end" and what comes next is superintelligence: agentic, reasons, understands and is self aware.

NeurIPS 2024 — 2024 Conference on Neural Information Processing Systems.

On super-intelligence, check out yesterday's long chat with Claude 3.5 Sonnet about super-intelligence. We started with chess and went on from there. No details on how to build it – it wasn't that kind of chat – but some speculation about the relationship between human intelligence and putative super-intelligence.

As for the AI of the future, the one Sutskever claims will be agentic, capable of reasoning (which has the side-effect of making it unpredictable), and capable of understanding, I can offer some thoughts about the brain. Some years ago David Hays and I reviewed a wide range of materials in neuroscience, developmental psychology, the phylogeny of brain development and so forth, and published a speculative paper: Principles and Development of Natural Intelligence. More recently I've extended those speculations with some new ones: Relational Nets Over Attractors, A Primer.

Wednesday, November 20, 2024

How far can next-token prediction take us? Sutskever vs. Claude

One of my main complaints about the current regime in machine learning is that researchers don’t seem to have given much thought to the nature of language and cognition independent from the more or less immediate requirements of crafting their models. There is a large, rich, and diverse literature on language, semantics, and cognition going back over a half century. It’s often conflicting and thus far from consensus, but it’s not empty. The ML research community seems uninterested in it. I’ve likened this to a whaling voyage captained by a man who knows all about ships and little about whales.

As a symptom of this, I offer this video clip from a 2023 conversation between Ilya Sutskever and Dwarkesh Patel in which next-token prediction will be able to surpass human performance:

Here's a transcription:

I challenge the claim that next-token prediction cannot surpass human performance. On the surface, it looks like it cannot. It looks like if you just learn to imitate, to predict what people do, it means that you can only copy people. But here is a counter argument for why it might not be quite so. If your base neural net is smart enough, you just ask it — What would a person with great insight, wisdom, and capability do? Maybe such a person doesn't exist, but there's a pretty good chance that the neural net will be able to extrapolate how such a person would behave. Do you see what I mean?

Dwarkesh Patel

Yes, although where would it get that sort of insight about what that person would do? If not from…

Ilya Sutskever

From the data of regular people. Because if you think about it, what does it mean to predict the next token well enough? It's actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It's not statistics. Like it is statistics but what is statistics? In order to understand those statistics to compress them, you need to understand what is it about the world that creates this set of statistics? And so then you say — Well, I have all those people. What is it about people that creates their behaviors? Well they have thoughts and their feelings, and they have ideas, and they do things in certain ways. All of those could be deduced from next-token prediction. And I'd argue that this should make it possible, not indefinitely but to a pretty decent degree to say — Well, can you guess what you'd do if you took a person with this characteristic and that characteristic? Like such a person doesn't exist but because you're so good at predicting the next token, you should still be able to guess what that person who would do. This hypothetical, imaginary person with far greater mental ability than the rest of us.

The argument is not clear. One thing Sutskever seems to be doing is aggregating the texts of ordinary people into the text of an imaginary “super” person that is the sum and synthesis what all those ordinary people have said. But all those ordinary individuals do not necessarily speak from the same point-of-view. There will be tensions and contradictions between them. The views of flat-earthers cannot be reconciled with those of standard astronomy. But this is not my main object. We can set it aside.

My problem comes with the Sutskever’s second paragraph, where he says, “Predicting the next token well means that you understand the underlying reality that led to the creation of that token.” From there he works his way through statistics to the thoughts and feelings of people producing the tokens. But Sutskever doesn’t distinguish between those thoughts and feelings and the world toward which those thoughts and feelings are directed. Those people are aware of the world, of the “underlying reality,” but that reality is not itself directly present in the language tokens they use to express their thoughts and feelings. The token string is the product of the interaction between language and cognition, on the one hand, and the world, on the other:

Sutskever seems to be conflating the cognitive and semantic structures inhering in the minds of the people who produce texts with the structure of the world itself. They are not at all the same thing. A statistical model produced through next-token prediction may well approximate the cognitive and semantic models of humans, but that’s all it can do. It has no access to the world in the way that the humans do. That underlying reality is not available to them.

What does Claude have to say about this?

I gave Claude 3.5 Sonnet Sutskever’s second paragraph and had a conversation about it. I wanted it to see if it could spot the problem. Claude saw various problems, but couldn’t quite find its way to what I regard as the crucial point. In the end I had to tell it that Sutskever failed to distinguish between the structure of the world and the structure of the semantic and cognitive structure expressed by the text.

My text is set in bold Courier while Claude's is plain Courier.

Saturday, November 16, 2024

In the beginning: OpenAI Email Archives (from Musk v. Altman)

Habryka over at LessWrong:

As part of the court case between Elon Musk and Sam Altman, a substantial number of emails between Elon, Sam Altman, Ilya Sutskever, and Greg Brockman have been released.

I have found reading through these really valuable, and I haven't found an online source that compiles all of them in an easy to read format. So I made one.

Here's the first four emails:

Sam Altman to Elon Musk - May 25, 2015 9:10 PM

Been thinking a lot about whether it's possible to stop humanity from developing AI.

I think the answer is almost definitely not.

If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first.

Any thoughts on whether it would be good for YC to start a Manhattan Project for AI? My sense is we could get many of the top ~50 to work on it, and we could structure it so that the tech belongs to the world via some sort of nonprofit but the people working on it get startup-like compensation if it works. Obviously we'd comply with/aggressively support all regulation.

Sam

Elon Musk to Sam Altman - May 25, 2015 11:09 PM

Probably worth a conversation

Sam Altman to Elon Musk - Jun 24, 2015 10:24 AM

The mission would be to create the first general AI and use it for individual empowerment—ie, the distributed version of the future that seems the safest. More generally, safety should be a first-class requirement.

I think we’d ideally start with a group of 7-10 people, and plan to expand from there. We have a nice extra building in Mountain View they can have.

I think for a governance structure, we should start with 5 people and I’d propose you, Bill Gates, Pierre Omidyar, Dustin Moskovitz, and me. The technology would be owned by the foundation and used “for the good of the world”, and in cases where it’s not obvious how that should be applied the 5 of us would decide. The researchers would have significant financial upside but it would be uncorrelated to what they build, which should eliminate some of the conflict (we’ll pay them a competitive salary and give them YC equity for the upside). We’d have an ongoing conversation about what work should be open-sourced and what shouldn’t. At some point we’d get someone to run the team, but he/she probably shouldn’t be on the governance board.

Will you be involved somehow in addition to just governance? I think that would be really helpful for getting work pointed in the right direction getting the best people to be part of it. Ideally you’d come by and talk to them about progress once a month or whatever. We generically call people involved in some limited way in YC “part-time partners” (we do that with Peter Thiel for example, though at this point he’s very involved) but we could call it whatever you want. Even if you can’t really spend time on it but can be publicly supportive, that would still probably be really helpful for recruiting.

I think the right plan with the regulation letter is to wait for this to get going and then I can just release it with a message like “now that we are doing this, I’ve been thinking a lot about what sort of constraints the world needs for safefy.” I’m happy to leave you off as a signatory. I also suspect that after it’s out more people will be willing to get behind it.

Sam

Elon Musk to Sam Altman - Jun 24, 2015 11:05 PM

Agree on all

There's much more at the link. Amazing stuff. 

Reading through the rest I'm reminded of a line from The Blues Brothers: "We're on a mission from God."

Monday, February 5, 2024

OpenAI Co-Founder Ilya Sutskever on the mystical powers of artificial neural nets

Transcription (which I found here):

Ilya Sutskever: I challenge the claim that next-token prediction cannot surpass human performance. On the surface, it looks like it cannot. It looks like if you just learn to imitate, to predict what people do, it means that you can only copy people. But here is a counter argument for why it might not be quite so. If your base neural net is smart enough, you just ask it — What would a person with great insight, wisdom, and capability do? Maybe such a person doesn’t exist, but there’s a pretty good chance that the neural net will be able to extrapolate how such a person would behave. Do you see what I mean?

Dwarkesh Patel: Yes, although where would it get that sort of insight about what that person would do? If not from…

Ilya Sutskever: From the data of regular people. Because if you think about it, what does it mean to predict the next token well enough? It’s actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It’s not statistics. Like it is statistics but what is statistics? In order to understand those statistics to compress them, you need to understand what is it about the world that creates this set of statistics? And so then you say — Well, I have all those people. What is it about people that creates their behaviors? Well they have thoughts and their feelings, and they have ideas, and they do things in certain ways. All of those could be deduced from next-token prediction. And I’d argue that this should make it possible, not indefinitely but to a pretty decent degree to say — Well, can you guess what you’d do if you took a person with this characteristic and that characteristic? Like such a person doesn’t exist but because you’re so good at predicting the next token, you should still be able to guess what that person who would do. This hypothetical, imaginary person with far greater mental ability than the rest of us

Yikes! If a stream of tokens is the only thing the machine has access to, then just how is it to divine the underlying reality? It's basing its predictions on its experience of the token stream, nothing else, N O T H I N G. These folks seem deeply enmeshed in what I've been calling the word illusion in a number of posts. 

This is the A.I. equivalent of believing the earth is flat.

Saturday, November 25, 2023

On possible cross-fertilization between AI and neuroscience [Creativity]

MIT Center for Minds, Brains, and Machines (CBMM), a panel discussion: CBMM10 - A Symposium on Intelligence: Brains, Minds, and Machines.

On which critical problems should Neuroscience, Cognitive Science, and Computer Science focus now? Do we need to understand fundamental principles of learning -- in the sense of theoretical understanding like in physics -- and apply this understanding to real natural and artificial systems? Similar questions concern neuroscience and human intelligence from the society, industry and science point of view.

Panel Chair: T. Poggio
Panelists: D. Hassabis, G. Hinton, P. Perona, D. Siegel, I. Sutskever

Quick Comments

1.) I’m a bit annoyed that Hassabis is giving neuroscience credit for the idea of episodic memory. As far as I know, the term was coined by a cognitive psychologist named Endel Tulving in the early 1970s, who stood it in opposition to semantic memory. That distinction was all over the place in the cognitive sciences in the 1970s and its second nature to me. When ChatGPT places a number of events in order to make a story, that’s episodic memory.

2.) Rather than theory, I like to think of what I call speculative engineering. I coined the phrase in the preface to my book about music (Beethoven’s Anvil), where I said:

Engineering is about design and construction: How does the nervous system design and construct music? It is speculative because it must be. The purpose of speculation is to clarify thought. If the speculation itself is clear and well-founded, it will achieve its end even when it is wrong, and many of my speculations must surely be wrong. If I then ask you to consider them, not knowing how to separate the prescient speculations from the mistaken ones, it is because I am confident that we have the means to sort these matters out empirically. My aim is to produce ideas interesting, significant, and clear enough to justify the hard work of investigation, both through empirical studies and through computer simulation.

3.) On Chomsky (Hinton & Hassabis): Yes, Chomsky is fundamentally wrong about language. Language is primarily a tool for conveying meaning from one person to another and only derivatively a tool for thinking. And he’s wrong that LLMs can learn any language and therefore they are useless for the scientific study of language. Another problem with Chomsky’s thinking is that he has no interest in process, which is in the realm of performance, not competence.

Let us assume for the sake of argument that the introduction of a single token into the output stream requires one primitive operation of the virtual system being emulated by an LLM. By that I mean that there is no logical operation within the process, no AND or OR, no shift of control; all that’s happening is one gigantic calculation involving all the parameters in the system. That means that the number of primitive operations required to produce a given output is equal to the number of tokens in that output. I suggest that that places severe constraints on the organization of the LLM’s associative memory.

Contrast that with what happens in a classical symbolic system. Let us posit that each time a word (not quite the same as a token in an LLM, but the difference is of no consequence) is emitted, that itself requires a single primitive operation in the classical system. Beyond that, however, a classical system has to execute numerous symbolic operations in order to arrive at each word. Regardless of just how those operations resolve into primitive symbolic operations, the number has to be larger, perhaps considerably larger, than the number of primitive operations an LLM requires. I suggest that this process places fewer constraints on the organization of a symbolic memory system.

At this point I’ve reached 45:11 in the video, but I have to stop and think. Perhaps I’ll offer some more comments later.

LATER: Creativity

4.) Near the end (01:20:00 or so) the question of creativity comes up. Hassibis says AIs aren't there yet. Hinton brings up analogy, pointing out that, with all the vast knowledge LLMs have ingested, they're got opportunities for coming up with analogy after analogy after analogy. I've got experience with ChatGPT that's directly relevant to those issues, analogy and creativity.

One of the first things I did once I started playing with ChatGPT was have it undertake a Girardian interpretation of Steven Spielberg's Jaws. To do that it has to determine whether or not there is an analogy between events in the film and the phenomena that Girard theorizes about. It did that fairly well. So I wrote that up and published it in 3 Quarks Daily, Conversing with ChatGPT about Jaws, Mimetic Desire, and Sacrifice. Near the end I remarked:

I was impressed with ChatGPT’s capabilities. Interacting with it was fun, so much fun that at times I was giggling and laughing out loud. But whether or not this is a harbinger of the much-touted Artificial General Intelligence (AGI), much less a warning of impending doom at the hands of an All-Knowing, All-Powerful Superintelligence – are you kidding? Nothing like that, nothing at all. A useful assistant for a variety of tasks, I can see that, and relatively soon. Maybe even a bit more than an assistant. But that’s as far as I can see.

We can compare what ChatGPT did in response to my prompting with what I did unprompted, freely and of my own volition. There’s nothing its replies that approaches my article, Shark City Sacrifice, nor the various blog posts I wrote about the film. That’s important. I was neither expecting, much less hoping, that ChatGPT would act like a full-on AGI. No, I have something else in mind.

What’s got my attention is what I had to do to write the article. In the first place I had to watch the film and make sense of it. As I’ve already indicated, have no artificial system with the required capabilities, visual, auditory, and cognitive. I watched the film several times in order to be sure of the details. I also consulted scripts I found on the internet. I also watched Jaws 2 more than once. Why did I do that? There’s curiosity and general principle. But there’s also the fact that the Wikipedia article for Jaws asserted that none of the three sequels were as good as the original. I had to watch the others to see for myself – though I was unable to finish watching either of that last two.

At this point I was on the prowl, though I hadn’t yet decided to write anything.

I now asked myself why the original was so much better than the first sequel, which was at least watchable. I came up with two things: 1) the original film was well-organized and tight while the sequel sprawled, and 2) Quint, there was no character in the sequel comparable to Quint.

Why did Quint die? Oh, I know what happened in the film; that’s not what I was asking. The question was an aesthetic one. As long as the shark was killed the town would be saved. That necessity did not entail the Quint’s death, nor anyone else’s. If Quint hadn’t died, how would the ending have felt? What if it had been Brody or Hooper?

It was while thinking about such questions that it hit me: sacrifice! Girard! How is it that Girard’s ideas came to me. I wasn’t looking for them, not in any direct sense. I was just asking counter-factual questions about the film.

Whatever.

Once Girard was on my mind I smelled blood, that is, the possibility of writing an interesting article. I started reading, making notes, and corresponding with my friend, David Porush, who knows Girard’s thinking much better than I do. Can I make a nice tight article? That’s what I was trying to figure out. I was only after I’d made some preliminary posts, drafted some text, and run it by David that I decided to go for it. The article turned out well enough that I decided to publish it. And so I did.

It’s one thing to figure out whether or not such and such a text/film exhibits such and such pattern when you are given the text and the pattern. That’s what ChatGPT did. Since I had already made the connection between Girard and Jaws it didn’t have to do that. I was just prompting ChatGPT to verify the connection, which it did (albeit in a weak way). That’s the kind of task we set for high school students and lower division college students. […]

I don’t really think that ChatGPT is operating at a high school level in this context. Nor do I NOT think that. I don’t know quite what to think. And I’m happy with that.

The deeper point is that there is a world of difference between what ChatGPT was doing when I piloted it into Jaws and Girard and what I eventually did when I watched Jaws and decided to look around to see what I could see. How is it that, in that process, Girard came to me? I wasn’t looking for Girard. I wasn’t looking for anything in particular. How do we teach a computer to look around for nothing in particular and come up with something interesting?

These observations are informal and are only about a single example. Given those limitations it's difficult to imagine a generalization. But I didn't hear anything from those experts that was comparably rich.

Hinton gave an example of an analogy that he posed to GPT-4 (01:18:30): “What has a compost heap got in common with an atom bomb?” It got the answer he was looking for, chain reaction, albeit at different energy levels and different rates. That's interesting. Why wasn't the panel ready with 20 such examples among them? Perhaps more to the point, doesn't Hinton see that it is one thing for GPT-4 to explain an analogy he presents to it, but that coming up with the analogy in the first place is a different kind of mental process?

Do they not have more such examples from their own work? Don't they think about their own work process, all the starts and stops, the wandering around, the dead ends and false starts, the open-ended exploration, that came before final success. And even then, no success is final, but only provisional pending further investigation. Can they not see the difference between what they do and what their machines do? Do they think all the need for exploration will just vanish in the face of machine superintelligence. Do they really believe that the universe is that small?

STILL LATER: Hinton and Hassabis on analogies

Hinton continues with analogies and Hassabis weights in:

1:18:28 – GEOFFREY HINTON: We know that being able to see analogies, especially remote analogies, is a very important aspect of intelligence. So I asked GPT-4, what has a compost heap got in common with an atom bomb? And GPT-4 nailed it, most people just say nothing.

DEMIS HASSABIS: What did it say ...

GEOFFREY HINTON: It started off by saying they're very different energy scales, so on the face of it, they look to be very different. But then it got into chain reactions and how the rate at which they're generating energy increases-- their energy increases the rate at which they generate energy. So it got the idea of a chain reaction. And the thing is, it knows about 10,000 times as much as a person, so it's going to be able to see all sorts of analogies that we can't see.

DEMIS HASSABIS: Yeah. So my feeling is on this, and starting with things like AlphaGo and obviously today's systems like Bard and GPT, they're clearly creative in ...

1:20:18 – New pieces of music, new pieces of poetry, and spotting analogies between things you couldn't spot as a human. And I think these systems can definitely do that. But then there's the third level which I call like invention or out-of-the-box thinking, and that would be the equivalent of AlphaGo inventing Go.

Well, yeah, sure, GPT-4 has all this stuff in its model, way more topics than any one human. But where’s GPT-4 going to “stand” so it can “look over” all that stuff and spot the analogies? That requires some kind of procedure. What is it?

For example, it might partition all that knowledge into discrete bits and then set up a 2D matrix with a column and a row for each discrete chunk of knowledge. Then it can move systematically through the matrix, checking each cell to see whether or not the pair in that cell is a useful analogy. What kind of tests does it apply to make that determination? I can imagine there might be a test or tests that allows a quick and dirty rejection for many candidates. But those that remain, what can you do but see if any useful knowledge follows from trying out the analogy. How long will that determination take? And so forth.

That’s absurd on the face of it. What else is there? I just explained what I went through to come up with an analogy between Jaws and Girard. But that’s just my behavior, not the mental process that’s behind the behavior. I have no trouble imagining that, in principle, having these machines will help speed up the process, but in the end I think we’re going to end up with a community of human investigators communicating with one another while they make sense of the world. The idea, which, judging from remarks he’s made elsewhere, Hinton seems to hold, that one of these days we’ll have a machine that takes humans out of the process all together, that’s an idle fantasy.

Monday, October 16, 2023

Highlights of the Fireside Chat with Ilya Sutskever & Jensen Huang: AI Today & Vision of the Future

This is the condensed version of the "Fireside Chat: With Ilya Sutskever and Jensen Huang: AI Today and Vision of the Future (March 2023)". In this video, I've carefully selected the top 10 questions from the original hour-long talk and condensed them into just over 30 minutes. Additionally, I've created a timeline of these questions asked by Jensen to Ilya, accompanied by relevant research papers. As machine transcription is often inaccurate, I manually transcribed the entire video (laborious), recognizing the importance of accurate captions for those with hearing difficulties. Anyways, you may find this information useful.

If you're interested in more AI-related content, consider subscribing to my channel. Stay tuned for future uploads!

Time-codes:

00:00:00 Q1. Intuition behind deep learning?
00:02:20 Q2. Initial motivations behind OpenAI?
00:07:54 Q3. Intuition about scaling laws (and RLHF)?
00:11:07 Q4. Aspects of ChatGPT and their abilities?
00:14:57 Q5. Major differences between GPT-4 and previous versions?
00:18:37 Q6. Reasoning capability of GPT-4 and limiting factors?
00:23:47 Q7. Importance of multi-modality?
00:28:53 Q8. How multi-modality helped improve GPT-4 over GPT-3?
00:30:46 Q9. Predictions for the next 2 years?
00:32:57 Q10. Surprising results?

Relevant research papers:

Learning to Generate Reviews and Discovering Sentiment https://arxiv.org/abs/1704.01444

Scaling Laws for Autoregressive Generative Modeling https://arxiv.org/abs/2010.14701

Training language models to follow instructions with human feedback https://arxiv.org/abs/2203.02155

Thursday, June 18, 2020

Transformer goes to work generating images

Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, David Luan, Ilya Sutskever, Generative Pretraining from Pixels, Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by the author(s).
Abstract
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Trans- former to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% ac- curacy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre- trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.

1. Introduction

Unsupervised pre-training played a central role in the resur- gence of deep learning. Starting in the mid 2000’s, ap- proaches such as the Deep Belief Network (Hinton et al., 2006) and Denoising Autoencoder (Vincent et al., 2008) were commonly used in neural networks for computer vi- sion (Lee et al., 2009) and speech recognition (Mohamed et al., 2009). It was believed that a model which learned the data distribution P(X) would also learn beneficial fea- tures for the subsequent supervised modeling of P (Y |X ) (Lasserre et al., 2006; Erhan et al., 2010). However, advancements such as piecewise linear activation functions (Nair & Hinton, 2010), improved initializations (Glorot & Ben- gio, 2010), and normalization strategies (Ioffe & Szegedy, 2015; Ba et al., 2016) removed the need for pre-training in order to achieve strong results. Other research cast doubt on the benefits of deep unsupervised representations and re- ported strong results using a single layer of learned features (Coates et al., 2011), or even random features (Huang et al., 2014; May et al., 2017). The approach fell out of favor as the state of the art increasingly relied on directly encoding prior structure into the model and utilizing abundant supervised data to directly learn representations (Krizhevsky et al., 2012; Graves & Jaitly, 2014). Retrospective study of unsupervised pre-training demonstrated that it could even hurt performance in modern settings (Paine et al., 2014). Instead, unsupervised pre-training flourished in a different domain. After initial strong results for word vectors (Mikolov et al., 2013), it has pushed the state of the art forward in Natural Language Processing on most tasks (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). Interestingly, the training objective of a dominant approach like BERT, the prediction of corrupted inputs, closely resembles that of the Denoising Autoencoder, which was originally developed for images.
As a higher dimensional, noisier, and more redundant modality than text, images are believed to be difficult for genera- tive modeling. Here, self-supervised approaches designed to encourage the modeling of more global structure (Doersch et al., 2015) have shown significant promise. A combination of new training objectives (Oord et al., 2018), more recent architectures (Gomez et al., 2017), and increased model capacity (Kolesnikov et al., 2019) has allowed these methods to achieve state of the art performance in low data settings (He ́naff et al., 2019) and sometimes even outperform super- vised representations in transfer learning settings (He et al., 2019; Misra & van der Maaten, 2019).
Given that it has been a decade since the original wave of generative pre-training methods for images and considering their substantial impact in NLP, this class of methods is due for a modern re-examination and comparison with the recent progress of self-supervised methods. We re-evaluate generative pre-training on images and demonstrate that when using a flexible architecture (Vaswani et al., 2017), a tractable and efficient likelihood based training objective (Larochelle & Murray, 2011; Oord et al., 2016), and significant compute resources (1024 TPU cores), generative pre-training is com- petitive with other self-supervised approaches and learns representations that significantly improve the state of the art in low-resolution unsupervised representation learning settings.
This is especially promising as our architecture uses a dense connectivity pattern which does not encode the 2D spatial structure of images yet is able to match and even outperform approaches which do. We report a set of experiments characterizing the performance of our approach on many datasets and in several different evaluation settings (low data, linear evaluation, full fine-tuning). We also conduct several exper- iments designed to better understand the achieved performance of these models. We investigate how representations are computed inside our model via the performance of linear probes as a function of model depth as well as studying how scaling the resolution and parameter count of the approach affects performance.