Panasonic Professor of Robotics, emeritus, MIT; founder, chair, and CTO, Rethink Robotics; author, Flesh and Machines

Think and intelligence are both what Marvin Minsky has called suitcase words—words into which we pack many meanings so we can talk about complex issues in shorthand. When we look inside these words, we find many different aspects, mechanisms, and levels of understanding. This makes answering the perennial questions of “Can machines think?” or “When will machines reach human-level intelligence?” difficult. The suitcase words are used to cover both specific performance demonstrations by machines and the more general competence that humans might have. We generalize from performance to competence and grossly overestimate the capabilities of machines—those of today and of the next few decades.

In 1997, a supercomputer beat world chess champion Garry Kasparov. Today dozens of programs running on laptop computers have higher chess rankings than those ever achieved by humans. Computers can definitely perform better than humans at playing chess. But they have nowhere near human-level competence at chess.

All chess-playing programs use Turing’s brute-force tree search method with heuristic evaluation. By the 1970s, computers were so fast that this approach overwhelmed AI programs that tried to play chess with processes emulating how people reported they thought about their next move, so those approaches were largely abandoned. Today’s chess programs have no way of determining why a particular move is “better” than another move, save that it takes the game to a part of the tree where the opponent has fewer good options. A human player can make generalizations to describe why certain types of moves are good, and use that to teach a human player. Brute-force programs cannot teach a human player except by being a sparring partner; it’s up to the humans to make the inferences and analogies and to do any learning on their own. The chess program doesn’t know it’s outsmarting the person, doesn’t know it’s a teaching aid, doesn’t know it’s playing something called chess, nor even what “playing” is. Making brute-force chess playing perform better than any human gets us no closer to competence in chess.

Now consider deep learning, which has caught people’s imaginations over the last year or so. It’s an update of backpropagation, a thirty-year-old learning algorithm loosely based on abstracted models of neurons. Layers of neurons map from a signal, such as amplitude of a sound wave or pixel brightness in an image, to increasingly higher-level descriptions of the full meaning of the signal, as words for sound or objects in images. Originally, backpropagation could work practically with only two or three layers of neurons, so preprocessing steps were needed to get the signals to more structured data before applying the learning algorithms. The new versions work with more layers of neurons, making the networks deeper—hence the name deep learning. Now early processing steps are also learned, and without misguided human biases of design the new algorithms are spectacularly better than the algorithms of just three years ago, which is why they’ve caught people’s imaginations. They rely on massive amounts of computer power in server farms and on very large data sets that didn’t formerly exist. But, critically, they also rely on new scientific innovations.

A well-known example of their performance is their labeling of an image (in English) as a baby with a stuffed toy. When you look at the image, that’s what you see. The algorithm has performed very well at labeling the image, much better than AI practitioners would have predicted. But it doesn’t have the full competence that a person who could label that same image would have.

The learning algorithm knows there’s a baby in the image, but it doesn’t know the structure of a baby and it doesn’t know where the baby is in the image. A current deep-learning algorithm can only assign probabilities to each pixel—that that particular pixel is part of the baby. Whereas a person can see that the baby occupies the middle quarter of the image, today’s algorithm has only a probabilistic idea of the baby’s spatial extent. It cannot apply an exclusionary rule and say that non-zero-probability pixels at extremes of the image cannot both be parts of the baby. If we look inside the neuron layers, it might be that one of the higher-level learned features is an eyelike patch of image and another feature is a footlike patch of image, but the current algorithm cannot discern the constraints of what spatial relationships could possibly be valid between eyes and feet in an image, and thus could be fooled by a grotesque collage of baby body parts, labeling it a baby. No person would do so, and would immediately know exactly what it was—a grotesque collage of baby body parts. Furthermore, the current algorithm is useless for telling a robot where to go in space to pick up that baby, or where to hold a bottle and feed the baby, or where to reach to change its diaper. Today’s algorithm has nothing like human-level competence in understanding images.

Work is under way to add focus of attention and handling of consistent spatial structure to deep learning. That’s the hard work of science and research, and we have no idea how hard it will be, nor how long it will take, nor whether the whole approach will reach a dead end. It took some thirty years to go from backpropagation to deep learning, but along the way many researchers were sure there was no future in backpropagation. They were wrong, but it wouldn’t have been surprising if they were right, as we knew all along that the backpropagation algorithm is not what happens inside people’s heads.

The fears of runaway AI systems either conquering humans or making them irrelevant aren’t even remotely well grounded. Misled by suitcase words, people are making category errors in fungibility of capabilities—category errors comparable to seeing the rise of more efficient internal combustion engines and jumping to the conclusion that warp drives are just around the corner.