Pages in this blog

Tuesday, May 14, 2019

Geoffrey Hinton (neural network pioneer) interview


From the interview:
Geoffrey Hinton: One of the big disappointments in the ’80’s was, if you made networks with lots of hidden layers, you couldn't train them. That's not quite true, because you could train for fairly simple tasks like recognizing handwriting. But most of the deep neural nets, we didn't know how to train them. And in about 2005, I came up with a way of doing unsupervised training of deep nets. So you take your input, say your pixels, and you'd learn a bunch of feature detectors that were just good at explaining why the pixels were even like that. And then you treat those feature detectors as the data, and you learn another bunch of feature detectors, so we could explain why those feature detectors have those correlations. And you keep learning layers and layers. But what was interesting was, you could do some math and prove that each time you learned another layer, you didn't necessarily have a better model of the data, but you had a band on how good your model was. And you could get a better band each time you added another layer.

Nicholas Thompson: What do you mean, you had a band on how good your model was?

GH: Once you've got a model, you can say, “How surprising does a model find this data?” You show it some data and you say, “Is that the kind of thing you believe in, or is that surprising?” And you can sort of measure something that says that. And what you'd like to do is have a model, a good model is one that looks at the data and says, “Yeah, yeah, I knew that. It's unsurprising.” It's often very hard to compute exactly how surprising this model finds the data. But you can compute a band on that. You can say that this model finds the data less surprising than that one. And you could show that as you add extra layers of feature detectors, you get a model, and each time you add a layer, the band on how surprising it finds the data gets better. [...]

NT: Well, what distinguishes the areas where it works the most quickly and the areas where it will take more time? It seems like visual processing, speech recognition, sort of core human things that we do with our sensory perception are deemed to be the first barriers to clear, is that correct?

GH: Yes and no, because there are other things we do like motor control. We're very good at motor control. Our brains are clearly designed for that. And only just now are neural nets beginning to compete with the best other technologies that’s there. They will win in the end, but they're only just winning now.

I think things like reasoning, abstract reasoning, they’re the kind of last things we learn to do, and I think they'll be among the last things these neural nets learn to do. [...]

NT: And then there's a separate problem, which is, we don't know entirely how these things work, right?

GH: No, we really don't know how they work.

NT: We don't understand how top-down neural networks work. That’s a core element of how neural networks work that we don't understand. Explain that, and then let me ask the obvious follow up, which is, if we don't know how these things work, how can those things work?

GH: If you look at current computer vision systems, most of them basically feed forward; they don't use feedback connections. There's something else about current computer vision systems, which is they're very prone to adversarial errors. You can change a few pixels slightly, and something that was a picture of a panda and still looks exactly like a panda to you, it suddenly says that’s an ostrich. Obviously, the way you change the pixels is cleverly designed to fool it into thinking it's an ostrich. But the point is, it still looks like a panda to you.

Initially we thought these things worked really well. But then, when confronted with the fact that they're looking at a panda and are confident it’s an ostrich, you get a bit worried. I think part of the problem there is that they're not trying to reconstruct from the high-level representations. They're trying to do discriminative learning, where you just learn layers of feature detectors, and the whole objective is just to change the weights so you get better at getting the right answer. And recently in Toronto, we've been discovering, or Nick Frost has been discovering, that if you introduce reconstruction, then it helps you be more resistant to adversarial attack. So I think in human vision, to do the learning, we're doing reconstruction. And also because we're doing a lot of learning by doing reconstructions, we are much more resistant to adversarial attacks. [...]

NT: True, fair enough. So what are we learning about the brain from our work in computers?

GH: So I think what we've learned in the last 10 years is that if you take a system with billions of parameters, and an objective function—like to fill in the gap in a string of words—it works much better than it has any right to. It works much better than you would expect. You would have thought, and most people in conventional AI thought, take a system with a billion parameters, start them off with random values, measure the gradient of the objective function—that is for each parameter, figure out how the objective function would change if you change that parameter a little bit—and then change it in the direction that improves the objective function. You would have thought that would be a kind of hopeless algorithm that gets stuck. But it turns out, it's a really good algorithm. And the bigger you scale things, the better it works. And that's just an empirical discovery, really. There's some theory coming along, but it's basically an empirical discovery. Now, because we've discovered that, it makes it far more plausible that the brain is computing the gradient of some objective function, and updating the weights of strength of synapses to follow that gradient. We just have to figure out how it gets degraded and what the objective function is.
One idea about dreaming (of 4):
GH: So a long time ago, there were things called Hopfield networks, and they would learn memories as local attractors. And Hopfield discovered that if you try and put too many memories in, they get confused. They'll take two local attractors and merge them into an attractor sort of halfway in between.

Then Francis Crick and Graeme Mitchison came along and said, we can get rid of these false minima by doing unlearning. So we turn off the input, we put the neural network into a random state, we let it settle down, and we say that's bad, change the connection so you don't settle to that state, and if you do a bit of that, it will be able to store more memories.

And then Terry Sejnowski and I came along and said, “Look, if we have not just the neurons where you’re storing the memories, but lots of other neurons too, can we find an algorithm that will use all these other neurons to help restore memories?” And it turned out in the end, we came up with the Boltzmann machine-learning algorithm, which had a very interesting property: I show you data, and it sort of rattles around the other units until it's got a fairly happy state, and once it's done that, it increases the strength of all the connections based on if two units are both active.

You also have to have a phase where you cut it off from the input, you let it rattle around and settle into a state it’s happy with, so now it's having a fantasy, and once it’s had the fantasy you say, “Take all pairs of neurons that are active and decrease the strength of the connection.”

So I'm explaining the algorithm to you just as a procedure. But actually, that algorithm is the result of doing some math and saying, “How should you change these connection strings, so that this neural network with all these hidden units finds the data unsurprising?” And it has to have this other phase, what we call the negative phase, when it's running with no input, and its unlearning whatever state it settles into.

We dream for many hours every night. And if I wake you up at random, you can tell me what you were just dreaming about because it’s in your short-term memory. So we know you dream for many hours, but when you wake up in the morning, you can remember the last dream but you can't remember all the others—which is lucky, because you might mistake them for reality. So why is it we don't remember our dreams at all? And Crick’s view was, the whole point of dreaming is to unlearn those things. So you put the learning all in reverse.

And Terry Sejnowski and I showed that, actually, that is a maximum-likelihood learning procedure for Boltzmann machines. So that's one theory of dreaming.

No comments:

Post a Comment