Pages in this blog

Monday, April 24, 2023

How Not To Destroy the World With AI - Stuart Russell

From the YouTube page:

About Talk:

It is reasonable to expect that artificial intelligence (AI) capabilities will eventually exceed those of humans across a range of real-world decision-making scenarios. Should this be a cause for concern, as Alan Turing and others have suggested? Will we lose control over our future? Or will AI complement and augment human intelligence in beneficial ways? It turns out that both views are correct, but they are talking about completely different forms of AI. To achieve the positive outcome, a fundamental reorientation of the field is required. Instead of building systems that optimize arbitrary objectives, we need to learn how to build systems that will, in fact, be beneficial for us. Russell will argue that this is possible as well as necessary. The new approach to AI opens up many avenues for research and brings into sharp focus several questions at the foundations of moral philosophy.

About Speaker:

Stuart Russell, OBE, is a professor of computer science at the University of California, Berkeley, and an honorary fellow of Wadham College at the University of Oxford. He is a leading researcher in artificial intelligence and the author, with Peter Norvig, of “Artificial Intelligence: A Modern Approach,” the standard text in the field. He has been active in arms control for nuclear and autonomous weapons. His latest book, “Human Compatible,” addresses the long-term impact of AI on humanity.

How do we get the machine to assist humans? (c. 36:26):

So we need to actually to get rid of the standard model. So we need a different model, right? This is the standard model. Machines are intelligent to the extent their actions can be expected to achieve their objectives.

Instead, we need the machines to be beneficial to us, right? We don't want this sort of pure intelligence that once it has the objective is off doing its thing, right? We want the systems to be beneficial, meaning that their actions can be expected to achieve our objectives.

And how do we do that? [...] That you do not build in a fixed known objective upfront. Instead, the machine knows that it doesn't know what the objective is, but it still needs a way of grounding its choices over the long run.

And the evidence about human preferences will say flows from human behavior. [...] So we call this an assistance game. So it's a, involves at least one person, at least one machine, and the machine is designed to be of assistance to the human. [...] The key point is there's a priori uncertainty about what those utility functions are. So it's gotta optimize something, but it doesn't know what it is.

And during, you know, if you solve the game, you in principle, you can just solve these games offline and then look at the solution and how it behaves. And as the solution unfolds effectively, information about the human utilities is flowing at runtime based on the human actions. And the humans can do deliberate actions to try to convey information, and that's part of the solution of the game. They can give commands, they can prohibit you from doing things, they can reward you for doing the right thing. [...]

So in some sense, you know, the entire record, the written record of humanity is, is a record of humans doing things and other people being upset about it, right? All of that information is useful for understanding what human preference structures really are algorithmically.

Yeah, we, you know, we can solve these and in fact, the, the one machine, one human game can be reduced to a partially observable MDP.

And for small versions of that we can solve it exactly. And actually look at the equilibrium of the game and, and how the agents behave. But an important point here and, the word alignment often is used in, in discussing these kinds of things.

And as Ken mentioned, it's related to inverse reinforcement learning, the learning of human preference structures by observing behavior. But alignment gives you this idea that we're gonna align the machine and the human and then off they go, right? That's never going to happen in practice.

The machines are always going to have a considerable uncertainty about human preference structures, right? Partly because there are just whole areas of the universe where there's no experience and no evidence from human behavior about how we would behave or how we would choose in those circumstances. And of course, you know, we don't know our own preferences in those areas. [...]

So when you look at these solutions, how does the robot behave? If it's playing this game, it actually defers to human requests and commands. It behaves cautiously because it doesn't wanna mess with parts of the world where it's not sure about your preferences. In the extreme case, it's willing to be switched off.

So I'm gonna have, in the interest of time, I'm gonna have to skip over the proof of that, which is prove with a little, a little game. But basically we can show very straightforwardly that as long as the robot is uncertain about how the human is going to choose, then it has a positive incentive to allow itself to be switched off, right? It gains information by leaving that choice available for the human. And it only closes off that choice when it has, well, or at least when it believes it has perfect knowledge of human preferences.

Indexical goals (51:33):

One might initially think, well, you know what they're doing. If they're learning to imitate humans, then, then maybe actually, you know, almost coincidentally that will end up with them being aligned with what humans want. All right? So perhaps we accidentally are solving the alignment problem here, by the way we're training these systems. And the answer to that is it depends. It depends on the type of goal that gets learned.

And I'll distinguish two types of goals. There's what we call common goals where things like painting the wool or mitigating climate change where if you do it, I'm happy if I do it, you are happy, we're all happy, right? These are goals where any agent doing these things would make all the agents happy.

Then there are indexical goals, which are meaning indexical to the individual who has the goal. So drinking coffee, right? I'm not happy if the robot drinks the coffee, right? What I want to have happen is if I'm drinking coffee and the robot does some inverse reinforcement, Hey, Stuart likes coffee, I'll make Stuart a cup of coffee in the morning. The robot drinking a coffee is not the same, right?

So this is what we mean by an indexable goal and becoming ruler of the universe, right? Is not the same if it's me versus the robot. Okay? And obviously if systems are learning indexical goals, that's arbitrarily bad as they get more and more capable, okay? And unfortunately, humans have a lot of indexical goals. We do not want AI systems to learn from humans in this way.

Imitation learning is not alignment.

No comments:

Post a Comment