I recently saw a person walking down a street described as a “penistrian.” You caught that, right? A “penistrian” instead of a “pedestrian.” I saw it in a large dataset of typos people make. A person sees someone walking and writes the word “penis.” Has to mean something, right?

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

I recently learned of a man who dreamed of eating a banana while walking to the altar to marry his wife. I saw it in a large dataset of dreams people record on an app. A man imagines marrying a woman while eating a phallic-shaped food. That also has to mean something, right?

Was Sigmund Freud right? Since his theories first came to public attention, the most honest answer to this question would be a shrug. It was Karl Popper, the Austrian-British philosopher, who made this point clearest. Popper famously claimed that Freud’s theories were not falsifiable. There was no way to test whether they were true or false.

Freud could say the person writing of a “penistrian” was revealing a possibly repressed sexual desire. The person could respond that she wasn’t revealing anything; that she could have just as easily made an innocent typo, such as “pedaltrian.” It would be a he-said, she-said situation. Freud could say the gentleman dreaming of eating a banana on his wedding day was secretly thinking of a penis, revealing his desire to really marry a man rather than a woman. The gentleman could say he just happened to be dreaming of a banana. He could have just as easily been dreaming of eating an apple as he walked to the altar. It would be he-said, he-said. There was no way to put Freud’s theory to a real test.

Until now, that is.

Data science makes many parts of Freud falsifiable—it puts many of his famous theories to the test. Let’s start with phallic symbols in dreams. Using a huge dataset of recorded dreams, we can readily note how frequently phallic-shaped objects appear. Food is a good place to focus this study. It shows up in many dreams, and many foods are shaped like phalluses—bananas, cucumbers, hot dogs, etc. We can then measure the factors that might make us dream more about certain foods than others—how frequently they are eaten, how tasty most people find them, and, yes, whether they are phallic in nature.

We can test whether two foods, both of which are equally popular, but one of which is shaped like a phallus, appear in dreams in different amounts. If phallus-shaped foods are no more likely to be dreamed about than other foods, then phallic symbols are not a significant factor in our dreams. Thanks to Big Data, this part of Freud’s theory may indeed be falsifiable.

I received data from Shadow, an app that asks users to record their dreams. I coded the foods included in tens of thousands of dreams.

Overall, what makes us dream of foods? The main predictor is how frequently we consume them. The substance that is most dreamed about is water. The top twenty foods include chicken, bread, sandwiches, and rice—all notably un-Freudian.

The second predictor of how frequently a food appears in dreams is how tasty people find it. The two foods we dream about most often are the notably un-Freudian but famously tasty chocolate and pizza.

So what about phallic-shaped foods? Do they sneak into our dreams with unexpected frequency? Nope.

Bananas are the second most common fruit to appear in dreams. But they are also the second most commonly consumed fruit. So we don’t need Freud to explain how often we dream about bananas. Cucumbers are the seventh most common vegetable to appear in dreams. They are the seventh most consumed vegetable. So again their shape isn’t necessary to explain their presence in our minds as we sleep. Hot dogs are dreamed of far less frequently than hamburgers. This is true even controlling for the fact that people eat more burgers than dogs.

Overall, using a regression analysis (a method that allows social scientists to tease apart the impact of multiple factors) across all fruits and vegetables, I found that a food’s being shaped like a phallus did not give it more likelihood of appearing in dreams than would be expected by its popularity. This theory of Freud’s is falsifiable—and, at least according to my look at the data, false.

Next, consider Freudian slips. The psychologist hypothesized that we use our errors—the ways we misspeak or miswrite—to reveal our subconscious desires, frequently sexual. Can we use Big Data to test this? Here’s one way: see if our errors—our slips—lean in the direction of the naughty. If our buried sexual desires sneak out in our slips, there should be a disproportionate number of errors that include words like “penis,” “cock,” and “sex.”

This is why I studied a dataset of more than 40,000 typing errors collected by Microsoft researchers. The dataset included mistakes that people make but then immediately correct. In these tens of thousands of errors, there were plenty of individuals committing errors of a sexual sort. There was the aforementioned “penistrian.” There was also someone who typed “sexurity” instead of “security” and “cocks” instead of “rocks.” But there were also plenty of innocent slips. People wrote of “pindows” and “fegetables,” “aftermoons” and “refriderators.”

So was the number of sexual slips unusual?

To test this, I first used the Microsoft dataset to model how frequently people mistakenly switch particular letters. I calculated how often they replace a t with an s, a g with an h. I then created a computer program that made mistakes in the way that people do. We might call it Error Bot. Error Bot replaced a t with an s with the same frequency that humans in the Microsoft study did. It replaced a g with an h as often as they did. And so on. I ran the program on the same words people had gotten wrong in the Microsoft study. In other words, the bot tried to spell “pedestrian” and “rocks,” “windows” and “refrigerator.” But it switched an r with a t as often as people do and wrote, for example, “tocks.” It switched an r with a c as often as humans do and wrote “cocks.”

So what do we learn from comparing Error Bot with normally careless humans? After making a few million errors, just from misplacing letters in the ways that humans do, Error Bot had made numerous mistakes of a Freudian nature. It misspelled “seashell” as “sexshell,” “lipstick” as “lipsdick,” and “luckiest” as “fuckiest,” along with many other similar mistakes. And—here’s the key point—Error Bot, which of course does not have a subconscious, was just as likely to make errors that could be perceived as sexual as real people were. With the caveat, as we social scientists like to say, that there needs to be more research, this means that sexually oriented errors are no more likely for humans to make than can be expected by chance.

In other words, for people to make errors such as “penistrian,” “sexurity,” and “cocks,” it is not necessary to have some connection between mistakes and the forbidden, some theory of the mind where people reveal their secret desires via their errors. These slips of the fingers can be explained entirely by the typical frequency of typos. People make lots of mistakes. And if you make enough mistakes, eventually you start saying things like “lipsdick,” “fuckiest,” and “penistrian.” If a monkey types long enough, he will eventually write “to be or not to be.” If a person types long enough, she will eventually write “penistrian.”

Freud’s theory that errors reveal our subconscious wants is indeed falsifiable—and, according to my analysis of the data, false.

Big Data tells us a banana is always just a banana and a “penistrian” just a misspelled “pedestrian.”

So was Freud totally off-target in all his theories? Not quite. When I first got access to PornHub data, I found a revelation there that struck me as at least somewhat Freudian. In fact, this is among the most surprising things I have found yet during my data investigations: a shocking number of people visiting mainstream porn sites are looking for portrayals of incest.

Of the top hundred searches by men on PornHub, one of the most popular porn sites, sixteen are looking for incest-themed videos. Fair warning—this is going to get a little graphic: they include “brother and sister,” “step mom fucks son,” “mom and son,” “mom fucks son,” and “real brother and sister.” The plurality of male incestuous searches are for scenes featuring mothers and sons. And women? Nine of the top hundred searches by women on PornHub are for incest-themed videos, and they feature similar imagery—though with the gender of any parent and child who is mentioned usually reversed. Thus the plurality of incestuous searches made by women are for scenes featuring fathers and daughters.

It’s not hard to locate in this data at least a faint echo of Freud’s Oedipal complex. He hypothesized a near-universal desire in childhood, which is later repressed, for sexual involvement with opposite-sex parents. If only the Viennese psychologist had lived long enough to turn his analytic skills to PornHub data, where interest in opposite-sex parents seems to be borne out by adults—with great explicitness—and little is repressed.

Of course, PornHub data can’t tell us for certain who people are fantasizing about when watching such videos. Are they actually imagining having sex with their own parents? Google searches can give some more clues that there are plenty of people with such desires.

Consider all searches of the form “I want to have sex with my . . .” The number one way to complete this search is “mom.” Overall, more than three-fourths of searches of this form are incestuous. And this is not due to the particular phrasing. Searches of the form “I am attracted to . . . ,” for example, are even more dominated by admissions of incestuous desires. Now I concede—at the risk of disappointing Herr Freud—that these are not particularly common searches: a few thousand people every year in the United States admitting an attraction to their mother. Someone would also have to break the news to Freud that Google searches, as will be discussed later in this book, sometimes skew toward the forbidden.

But still. There are plenty of inappropriate attractions that people have that I would have expected to have been mentioned more frequently in searches. Boss? Employee? Student? Therapist? Patient? Wife’s best friend? Daughter’s best friend? Wife’s sister? Best friend’s wife? None of these confessed desires can compete with mom. Maybe, combined with the PornHub data, that really does mean something.

And Freud’s general assertion that sexuality can be shaped by childhood experiences is supported elsewhere in Google and PornHub data, which reveals that men, at least, retain an inordinate number of fantasies related to childhood. According to searches from wives about their husbands, some of the top fetishes of adult men are the desire to wear diapers and wanting to be breastfed, particularly, as discussed earlier, in India. Moreover, cartoon porn—animated explicit sex scenes featuring characters from shows popular among adolescent boys—has achieved a high degree of popularity. Or consider the occupations of women most frequently searched for in porn by men. Men who are 18–24 years old search most frequently for women who are babysitters. As do 25–64-year-old men. And men 65 years and older. And for men in every age group, teacher and cheerleader are both in the top four. Clearly, the early years of life seem to play an outsize role in men’s adult fantasies.

I have not yet been able to use all this unprecedented data on adult sexuality to figure out precisely how sexual preferences form. Over the next few decades, other social scientists and I will be able to create new, falsifiable theories on adult sexuality and test them with actual data.

Already I can predict some basic themes that will undoubtedly be part of a data-based theory of adult sexuality. It is clearly not going to be the identical story to the one Freud told, with his particular, well-defined, universal stages of childhood and repression. But, based on my first look at PornHub data, I am absolutely certain the final verdict on adult sexuality will feature some key themes that Freud emphasized. Childhood will play a major role. So will mothers.

It likely would have been impossible to analyze Freud in this way ten years ago. It certainly would have been impossible eighty years ago, when Freud was still alive. So let’s think through why these data sources helped. This exercise can help us understand why Big Data is so powerful.

Remember, we have said that just having mounds and mounds of data by itself doesn’t automatically generate insights. Data size, by itself, is overrated. Why, then, is Big Data so powerful? Why will it create a revolution in how we see ourselves? There are, I claim, four unique powers of Big Data. This analysis of Freud provides a good illustration of them.

You may have noticed, to begin with, that we’re taking pornography seriously in this discussion of Freud. And we are going to utilize data from pornography frequently in this book. Somewhat surprisingly, porn data is rarely utilized by sociologists, most of whom are comfortable relying on the traditional survey datasets they have built their careers on. But a moment’s reflection shows that the widespread use of porn—and the search and views data that comes with it—is the most important development in our ability to understand human sexuality in, well . . . Actually, it’s probably the most important ever. It is data that Schopenhauer, Nietzsche, Freud, and Foucault would have drooled over. This data did not exist when they were alive. It did not exist a couple decades ago. It exists now. There are many unique data sources, on a range of topics, that give us windows into areas about which we could previously just guess. Offering up new types of data is the first power of Big Data.

The porn data and the Google search data are not just new; they are honest. In the pre-digital age, people hid their embarrassing thoughts from other people. In the digital age, they still hide them from other people, but not from the internet and in particular sites such as Google and PornHub, which protect their anonymity. These sites function as a sort of digital truth serum—hence our ability to uncover a widespread fascination with incest. Big Data allows us to finally see what people really want and really do, not what they say they want and say they do. Providing honest data is the second power of Big Data.

Because there is now so much data, there is meaningful information on even tiny slices of a population. We can compare, say, the number of people who dream of cucumbers versus those who dream of tomatoes. Allowing us to zoom in on small subsets of people is the third power of Big Data.

Big Data has one more impressive power—one that was not utilized in my quick study of Freud but could be in a future one: it allows us to undertake rapid, controlled experiments. This allows us to test for causality, not merely correlations. These kinds of tests are mostly used by businesses now, but they will prove a powerful tool for social scientists. Allowing us to do many causal experiments is the fourth power of Big Data.

Now it is time to unpack each of these powers and explore exactly why Big Data matters.