THE VALUE-LOADING PROBLEM

ELIEZER S. YUDKOWSKY

Artificial intelligence theorist; research fellow and cofounder, Machine Intelligence Research Institute

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

The prolific bank robber Willie Sutton, when asked why he robbed banks, reportedly replied, “Because that’s where the money is.” When it comes to AI, the most important issues are about extremely powerful, smarter-than-human artificial intelligence (aka superintelligence) because that’s where the utilons are—the value at stake. Minds that are more powerful have bigger real-world impacts.

Along with this observation goes a disclaimer: Being concerned about superintelligence doesn’t mean I think superintelligence will happen soon. Conversely, counterarguments about superintelligence being decades away, or current AI algorithms not being on a clear track toward generality, don’t refute the fact that most of the value at stake for the future revolves around smarter-than-human AI if and when it’s built. (As Stuart Russell has observed elsewhere, if we received a radio signal from a more advanced alien civilization saying they’d arrive here in sixty years, you wouldn’t shrug and say, “Eh, it’s sixty years off.” Especially not if you had children.)

Among the issues of superintelligence, the most important (again following Sutton’s Law) is, I would say, what Nick Bostrom has termed the “value-loading problem”: how to construct superintelligences that want outcomes that are high-value, normative, and beneficial for intelligent life over the long run—that are, in short, “good”—since if there’s a cognitively powerful agent around, what it wants is probably what will happen.

Here are some brief arguments for why building AIs that prefer good outcomes is (a) important and (b) likely to be technically difficult.

First, why is it important to create a superintelligence with particular goals? Can’t it figure out its own goals?

As far back as 1739, David Hume observed a gap between “is” questions and “ought” questions, calling attention in particular to the sudden leap between when a philosopher speaks of how the world is and then begins using words like should, ought, or ought not. From a modern perspective, we’d say that an agent’s utility function (goals, preferences, ends) contains extra information not given in the agent’s probability distribution (beliefs, world-model, map of reality).

If in 100 million years we see (a) an intergalactic civilization full of diverse, marvelously strange intelligences interacting with one another, with most of them happy most of the time, then is that better or worse than (b) most available matter having been transformed into paper clips? What Hume’s insight tells us is that if you specify a mind with a preference (a) > (b), we can follow back the trace of where the > (the preference ordering) first entered the system and imagine a mind with a different algorithm that computes (a) < (b) instead. Show me a mind aghast at the seeming folly of pursuing paper clips and I can follow back Hume’s regress and exhibit a slightly different mind that computes < instead of > on that score too.

I don’t particularly think silicon-based intelligence should forever be the slave of carbon-based intelligence. But if we want to end up with a diverse cosmopolitan civilization instead of, for example, paper clips, we may need to ensure that the first sufficiently advanced AI is built with a utility function whose maximum pinpoints that outcome. If we want an AI to do its own moral reasoning, Hume’s Law says we need to define the framework for that reasoning. This takes an extra fact beyond the AI having an accurate model of reality and being an excellent planner.

But if Hume’s Law makes it possible in principle to have cognitively powerful agents with any goals, why is value loading likely to be difficult? Don’t we just get whatever we programmed?

The answer is that we get what we programmed but not necessarily what we wanted. The worrisome scenario isn’t AIs spontaneously developing emotional resentment of humans. It’s that we create an inductive, value-learning algorithm and show the AI examples of happy smiling humans labeled as high-value events—and in the early days the AI goes around making existing humans smile, and it looks like everything is OK and the methodology is being experimentally validated; and then, when the AI is smart enough, it invents molecular nanotechnology and tiles the universe with tiny molecular smiley faces. Hume’s Law, unfortunately, implies that raw cognitive power doesn’t intrinsically prevent this outcome, even though it’s not the result we wanted.

Getting past this sort of issue isn’t unsolvable, but it’s looking to be technically difficult, and we may have to get it right the first time we build something smarter than we are. The prospect of needing to get anything in AI right on the first try, with the future of all intelligent life at stake, should properly result in terrified screams from anyone familiar with the field.

Whether advanced AI is first created by nice people or bad people won’t make much difference if even the nice people don’t know how to make nice AIs. The obvious response—of immediately starting technical research on the value-loading problem—has its own difficulties, to say the least. Current AI algorithms aren’t smart enough to exhibit most of the difficulties we can foresee for sufficiently advanced agents—meaning there’s no way to test proposed solutions to those difficulties. But considering the maximal importance of the problem, some people are trying to get started as early as possible. The research priorities set forth by Max Tegmark’s Future of Life Institute are one step in this direction.

But for now, the value-loading problem is unsolved. There are no proposed full solutions, even in principle. And if that goes on being true over the next decades, I can’t promise you that the development of sufficiently advanced AI will be at all a good thing.