If you’re thirty-three years old and have attended a few Thanksgivings in a row without a date, the topic of mate choice is likely to arise. And just about everybody will have an opinion.
“Seth needs a crazy girl, like him,” my sister says.
“You’re crazy! He needs a normal girl, to balance him out,” my brother says.
“Seth’s not crazy,” my mother says.
“You’re crazy! Of course, Seth is crazy,” my father says.
All of a sudden, my shy, soft-spoken grandmother, quiet through the dinner, speaks. The loud, aggressive New York voices go silent, and all eyes focus on the small old lady with short yellow hair and still a trace of an Eastern European accent. “Seth, you need a nice girl. Not too pretty. Very smart. Good with people. Social, so you will do things. Sense of humor, because you have a good sense of humor.”
Why does this old woman’s advice command such attention and respect in my family? Well, my eighty-eight-year-old grandmother has seen more than everybody else at the table. She’s observed more marriages, many that worked and many that didn’t. And over the decades, she has cataloged the qualities that make for successful relationships. At that Thanksgiving table, for that question, my grandmother has access to the largest number of data points. My grandmother is Big Data.
In this book, I want to demystify data science. Like it or not, data is playing an increasingly important role in all of our lives—and its role is going to get larger. Newspapers now have full sections devoted to data. Companies have teams with the exclusive task of analyzing their data. Investors give start-ups tens of millions of dollars if they can store more data. Even if you never learn how to run a regression or calculate a confidence interval, you are going to encounter a lot of data—in the pages you read, the business meetings you attend, the gossip you hear next to the watercoolers you drink from.
Many people are anxious over this development. They are intimidated by data, easily lost and confused in a world of numbers. They think that a quantitative understanding of the world is for a select few left-brained prodigies, not for them. As soon as they encounter numbers, they are ready to turn the page, end the meeting, or change the conversation.
But I have spent ten years in the data analysis business and have been fortunate to work with many of the top people in the field. And one of the most important lessons I have learned is this: Good data science is less complicated than people think. The best data science, in fact, is surprisingly intuitive.
What makes data science intuitive? At its core, data science is about spotting patterns and predicting how one variable will affect another. People do this all the time.
Just think how my grandmother gave me relationship advice. She utilized the large database of relationships that her brain has uploaded over a near century of life—in the stories she has heard from her family, her friends, her acquaintances. She limited her analysis to a sample of relationships in which the man had many qualities that I have—a sensitive temperament, a tendency to isolate himself, a sense of humor. She zeroed in on key qualities of the woman—how kind she was, how smart she was, how pretty she was. She correlated these key qualities of the woman with a key quality of the relationship—whether it was a good one. Finally, she reported her results. In other words, she spotted patterns and predicted how one variable will affect another. Grandma is a data scientist.
You are a data scientist, too. When you were a kid, you noticed that when you cried, your mom gave you attention. That is data science. When you reached adulthood, you noticed that if you complain too much, people want to hang out with you less. That is data science, too. When people hang out with you less, you noticed, you are less happy. When you are less happy, you are less friendly. When you are less friendly, people want to hang out with you even less. Data science. Data science. Data science.
Because data science is so natural, the best Big Data studies, I have found, can be understood by just about any smart person. If you can’t understand a study, the problem is probably with the study, not with you.
Want proof that great data science tends to be intuitive? I recently came across a study that may be one of the most important conducted in the past few years. It is also one of the most intuitive studies I’ve ever seen. I want you to think not just about the importance of the study—but how natural and grandma-like it is.
The study was by a team of researchers from Columbia University and Microsoft. The team wanted to find what symptoms predict pancreatic cancer. This disease has a low five-year survival rate—only about 3 percent—but early detection can double a patient’s chances.
The researchers’ method? They utilized data from tens of thousands of anonymous users of Bing, Microsoft’s search engine. They coded a user as having recently been given a diagnosis of pancreatic cancer based on unmistakable searches, such as “just diagnosed with pancreatic cancer” or “I was told I have pancreatic cancer, what to expect.”
Next, the researchers looked at searches for health symptoms. They compared that small number of users who later reported a pancreatic cancer diagnosis with those who didn’t. What symptoms, in other words, predicted that, in a few weeks or months, a user will be reporting a diagnosis?
The results were striking. Searching for back pain and then yellowing skin turned out to be a sign of pancreatic cancer; searching for just back pain alone made it unlikely someone had pancreatic cancer. Similarly, searching for indigestion and then abdominal pain was evidence of pancreatic cancer, while searching for just indigestion without abdominal pain meant a person was unlikely to have it. The researchers could identify 5 to 15 percent of cases with almost no false positives. Now, this may not sound like a great rate; but if you have pancreatic cancer, even a 10 percent chance of possibly doubling your chances of survival would feel like a windfall.
The paper detailing this study would be difficult for non-experts to fully make sense of. It includes a lot of technical jargon, such as the Kolmogorov-Smirnov test, the meaning of which, I have to admit, I had forgotten. (It’s a way to determine whether a model correctly fits data.)
However, note how natural and intuitive this remarkable study is at its most fundamental level. The researchers looked at a wide array of medical cases and tried to connect symptoms to a particular illness. You know who else uses this methodology in trying to figure out whether someone has a disease? Husbands and wives, mothers and fathers, and nurses and doctors. Based on experience and knowledge, they try to connect fevers, headaches, runny noses, and stomach pains to various diseases. In other words, the Columbia and Microsoft researchers wrote a groundbreaking study by utilizing the natural, obvious methodology that everybody uses to make health diagnoses.
But wait. Let’s slow down here. If the methodology of the best data science is frequently natural and intuitive, as I claim, this raises a fundamental question about the value of Big Data. If humans are naturally data scientists, if data science is intuitive, why do we need computers and statistical software? Why do we need the Kolmogorov-Smirnov test? Can’t we just use our gut? Can’t we do it like Grandma does, like nurses and doctors do?
This gets to an argument intensified after the release of Malcolm Gladwell’s bestselling book Blink, which extols the magic of people’s gut instincts. Gladwell tells the stories of people who, relying solely on their guts, can tell whether a statue is fake; whether a tennis player will fault before he hits the ball; how much a customer is willing to pay. The heroes in Blink do not run regressions; they do not calculate confidence intervals; they do not run Kolmogorov-Smirnov tests. But they generally make remarkable predictions. Many people have intuitively supported Gladwell’s defense of intuition: they trust their guts and feelings. Fans of Blink might celebrate the wisdom of my grandmother giving relationship advice without the aid of computers. Fans of Blink may be less apt to celebrate my studies or the other studies profiled in this book, which use computers. If Big Data—of the computer type, rather than the grandma type—is a revolution, it has to prove that it’s more powerful than our unaided intuition, which, as Gladwell has pointed out, can often be remarkable.
The Columbia and Microsoft study offers a clear example of rigorous data science and computers teaching us things our gut alone could never find. This is also one case where the size of the dataset matters. Sometimes there is insufficient experience for our unaided gut to draw upon. It is unlikely that you—or your close friends or family members—have seen enough cases of pancreatic cancer to tease out the difference between indigestion followed by abdominal pain compared to indigestion alone. Indeed, it is inevitable, as the Bing dataset gets bigger, that the researchers will pick up many more subtle patterns in the timing of symptoms—for this and other illnesses—that even doctors might miss.
Moreover, while our gut may usually give us a good general sense of how the world works, it is frequently not precise. We need data to sharpen the picture. Consider, for example, the effects of weather on mood. You would probably guess that people are more likely to feel more gloomy on a 10-degree day than on a 70-degree day. Indeed, this is correct. But you might not guess how big an impact this temperature difference can make. I looked for correlations between an area’s Google searches for depression and a wide range of factors, including economic conditions, education levels, and church attendance. Winter climate swamped all the rest. In winter months, warm climates, such as that of Honolulu, Hawaii, have 40 percent fewer depression searches than cold climates, such as that of Chicago, Illinois. Just how significant is this effect? An optimistic read of the effectiveness of antidepressants would find that the most effective drugs decrease the incidence of depression by only about 20 percent. To judge from the Google numbers, a Chicago-to-Honolulu move would be at least twice as effective as medication for your winter blues.*
Sometimes our gut, when not guided by careful computer analysis, can be dead wrong. We can get blinded by our own experiences and prejudices. Indeed, even though my grandmother is able to utilize her decades of experience to give better relationship advice than the rest of my family, she still has some dubious views on what makes a relationship last. For example, she has frequently emphasized to me the importance of having common friends. She believes that this was a key factor in her marriage’s success: she spent most warm evenings with her husband, my grandfather, in their small backyard in Queens, New York, sitting on lawn chairs and gossiping with their tight group of neighbors.
However, at the risk of throwing my own grandmother under the bus, data science suggests that Grandma’s theory is wrong. A team of computer scientists recently analyzed the biggest dataset ever assembled on human relationships—Facebook. They looked at a large number of couples who were, at some point, “in a relationship.” Some of these couples stayed “in a relationship.” Others switched their status to “single.” Having a common core group of friends, the researchers found, is a strong predictor that a relationship will not last. Perhaps hanging out every night with your partner and the same small group of people is not such a good thing; separate social circles may help make relationships stronger.
As you can see, our intuition alone, when we stay away from the computers and go with our gut, can sometimes amaze. But it can make big mistakes. Grandma may have fallen into one cognitive trap: we tend to exaggerate the relevance of our own experience. In the parlance of data scientists, we weight our data, and we give far too much weight to one particular data point: ourselves.
Grandma was so focused on her evening schmoozes with Grandpa and their friends that she did not think enough about other couples. She forgot to fully consider her brother-in-law and his wife, who chitchatted most nights with a small, consistent group of friends but fought frequently and divorced. She forgot to fully consider my parents, her daughter and son-in-law. My parents go their separate ways many nights—my dad to a jazz club or ball game with his friends, my mom to a restaurant or the theater with her friends; yet they remain happily married.
When relying on our gut, we can also be thrown off by the basic human fascination with the dramatic. We tend to overestimate the prevalence of anything that makes for a memorable story. For example, when asked in a survey, people consistently rank tornadoes as a more common cause of death than asthma. In fact, asthma causes about seventy times more deaths. Deaths by asthma don’t stand out—and don’t make the news. Deaths by tornadoes do.
We are often wrong, in other words, about how the world works when we rely just on what we hear or personally experience. While the methodology of good data science is often intuitive, the results are frequently counterintuitive. Data science takes a natural and intuitive human process—spotting patterns and making sense of them—and injects it with steroids, potentially showing us that the world works in a completely different way from how we thought it did. That’s what happened when I studied the predictors of basketball success.
When I was a little boy, I had one dream and one dream only: I wanted to grow up to be an economist and data scientist. No. I’m just kidding. I wanted desperately to be a professional basketball player, to follow in the footsteps of my hero, Patrick Ewing, all-star center for the New York Knicks.
I sometimes suspect that inside every data scientist is a kid trying to figure out why his childhood dreams didn’t come true. So it is not surprising that I recently investigated what it takes to make the NBA. The results of the investigation were surprising. In fact, they demonstrate once again how good data science can change your view of the world, and how counterintuitive the numbers can be.
The particular question I looked at is this: are you more likely to make it in the NBA if you grow up poor or middle-class?
Most people would guess the former. Conventional wisdom says that growing up in difficult circumstances, perhaps in the projects with a single, teenage mom, helps foster the drive necessary to reach the top levels of this intensely competitive sport.
This view was expressed by William Ellerbee, a high school basketball coach in Philadelphia, in an interview with Sports Illustrated. “Suburban kids tend to play for the fun of it,” Ellerbee said. “Inner-city kids look at basketball as a matter of life or death.” I, alas, was raised by married parents in the New Jersey suburbs. LeBron James, the best player of my generation, was born poor to a sixteen-year-old single mother in Akron, Ohio.
Indeed, an internet survey I conducted suggested that the majority of Americans think the same thing Coach Ellerbee and I thought: that most NBA players grow up in poverty.
Is this conventional wisdom correct?
Let’s look at the data. There is no comprehensive data source on the socioeconomics of NBA players. But by being data detectives, by utilizing data from a whole bunch of sources—basketball-reference.com, ancestry.com, the U.S. Census, and others—we can figure out what family background is actually most conducive to making the NBA. This study, you will note, uses a variety of data sources, some of them bigger, some of them smaller, some of them online, and some of them offline. As exciting as some of the new digital sources are, a good data scientist is not above consulting old-fashioned sources if they can help. The best way to get the right answer to a question is to combine all available data.
The first relevant data is the birthplace of every player. For every county in the United States, I recorded how many black and white men were born in the 1980s. I then recorded how many of them reached the NBA. I compared this to a county’s average household income. I also controlled for the racial demographics of a county, since—and this is a subject for a whole other book—black men are about forty times more likely than white men to reach the NBA.
The data tells us that a man has a substantially better chance of reaching the NBA if he was born in a wealthy county. A black kid born in one of the wealthiest counties in the United States, for example, is more than twice as likely to make the NBA than a black kid born in one of the poorest counties. For a white kid, the advantage of being born in one of the wealthiest counties compared to being born in one of the poorest is 60 percent.
This suggests, contrary to conventional wisdom, that poor men are actually underrepresented in the NBA. However, this data is not perfect, since many wealthy counties in the United States, such as New York County (Manhattan), also include poor neighborhoods, such as Harlem. So it’s still possible that a difficult childhood helps you make the NBA. We still need more clues, more data.
So I investigated the family backgrounds of NBA players. This information was found in news stories and on social networks. This methodology was quite time-consuming, so I limited the analysis to the one hundred African-American NBA players born in the 1980s who scored the most points. Compared to the average black man in the United States, NBA superstars were about 30 percent less likely to have been born to a teenage mother or an unwed mother. In other words, the family backgrounds of the best black NBA players also suggest that a comfortable background is a big advantage for achieving success.
That said, neither the county-level birth data nor the family background of a limited sample of players gives perfect information on the childhoods of all NBA players. So I was still not entirely convinced that two-parent, middle-class families produce more NBA stars than single-parent, poor families. The more data we can throw at this question, the better.
Then I remembered one more data point that can provide telling clues to a man’s background. It was suggested in a paper by two economists, Roland Fryer and Steven Levitt, that a black person’s first name is an indication of his socioeconomic background. Fryer and Levitt studied birth certificates in California in the 1980s and found that, among African-Americans, poor, uneducated, and single moms tend to give their kids different names than do middle-class, educated, and married parents.
Kids from better-off backgrounds are more likely to be given common names, such as Kevin, Chris, and John. Kids from difficult homes in the projects are more likely to be given unique names, such as Knowshon, Uneek, and Breionshay. African-American kids born into poverty are nearly twice as likely to have a name that is given to no other child born in that same year.
So what about the first names of black NBA players? Do they sound more like middle-class or poor blacks? Looking at the same time period, California-born NBA players were half as likely to have unique names as the average black male, a statistically significant difference.
Know someone who thinks the NBA is a league for kids from the ghetto? Tell him to just listen closely to the next game on the radio. Tell him to note how frequently Russell dribbles past Dwight and then tries to slip the ball past the outstretched arms of Josh and into the waiting hands of Kevin. If the NBA really were a league filled with poor black men, it would sound quite different. There would be a lot more men with names like LeBron.
Now, we have gathered three different pieces of evidence—the county of birth, the marital status of the mothers of the top scorers, and the first names of players. No source is perfect. But all three support the same story. Better socioeconomic status means a higher chance of making the NBA. The conventional wisdom, in other words, is wrong.
Among all African-Americans born in the 1980s, about 60 percent had unmarried parents. But I estimate that among African-Americans born in that decade who reached the NBA, a significant majority had married parents. In other words, the NBA is not composed primarily of men with backgrounds like that of LeBron James. There are more men like Chris Bosh, raised by two parents in Texas who cultivated his interest in electronic gadgets, or Chris Paul, the second son of middle-class parents in Lewisville, North Carolina, whose family joined him on an episode of Family Feud in 2011.
The goal of a data scientist is to understand the world. Once we find the counterintuitive result, we can use more data science to help us explain why the world is not as it seems. Why, for example, do middle-class men have an edge in basketball relative to poor men? There are at least two explanations.
First, because poor men tend to end up shorter. Scholars have long known that childhood health care and nutrition play a large role in adult health. This is why the average man in the developed world is now four inches taller than a century and a half ago. Data suggests that Americans from poor backgrounds, due to weaker early-life health care and nutrition, are shorter.
Data can also tell us the effect of height on reaching the NBA. You undoubtedly intuited that being tall can be of assistance to an aspiring basketball player. Just contrast the height of the typical ballplayer on the court to the typical fan in the stands. (The average NBA player is 6’7”; the average American man is 5’9”.)
How much does height matter? NBA players sometimes fib a little about their height, and there is no listing of the complete height distribution of American males. But working with a rough mathematical estimate of what this distribution might look like and the NBA’s own numbers, it is easy to confirm that the effects of height are enormous—maybe even more than we might have suspected. I estimate that each additional inch roughly doubles your odds of making it to the NBA. And this is true throughout the height distribution. A 5’11” man has twice the odds of reaching the NBA as a 5’10” man. A 6’11” man has twice the odds of reaching the NBA as a 6’10” man. It appears that, among men less than six feet tall, only about one in two million reach the NBA. Among those over seven feet tall, I and others have estimated, something like one in five reach the NBA.
Data, you will note, clarifies why my dream of basketball stardom was derailed. It was not because I was brought up in the suburbs. It was because I am 5’9” and white (not to mention slow). Also, I am lazy. And I have poor stamina, awful shooting form, and occasionally a panic attack when the ball gets in my hand.
A second reason that boys from tough backgrounds may struggle to make the NBA is that they sometimes lack certain social skills. Using data on thousands of schoolchildren, economists have found that middle-class, two-parent families are on average substantially better at raising kids who are trusting, disciplined, persistent, focused, and organized.
So how do poor social skills derail an otherwise promising basketball career?
Let’s look at the story of Doug Wrenn, one of the most talented basketball prospects in the 1990s. His college coach, Jim Calhoun at the University of Connecticut, who has trained future NBA all-stars, claimed Wrenn jumped the highest of any man he had ever worked with. But Wrenn had a challenging upbringing. He was raised by a single mother in Blood Alley, one of the roughest neighborhoods in Seattle. In Connecticut, he consistently clashed with those around him. He would taunt players, question coaches, and wear loose-fitting clothes in violation of team rules. He also had legal troubles—he stole shoes from a store and snapped at police officers. Calhoun finally had enough and kicked him off the team.
Wrenn got a second chance at the University of Washington. But there, too, an inability to get along with people derailed him. He fought with his coach over playing time and shot selection and was kicked off this team as well. Wrenn went undrafted by the NBA, bounced around lower leagues, moved in with his mother, and was eventually imprisoned for assault. “My career is over,” Wrenn told the Seattle Times in 2009. “My dreams, my aspirations are over. Doug Wrenn is dead. That basketball player, that dude is dead. It’s over.” Wrenn had the talent not just to be an NBA player, but to be a great, even a legendary player. But he never developed the temperament to even stay on a college team. Perhaps if he’d had a stable early life, he could have been the next Michael Jordan.
Michael Jordan, of course, also had an impressive vertical leap. Plus a large ego and intense competitiveness—a personality at times that was not unlike Wrenn’s. Jordan could be a difficult kid. At the age of twelve, he was kicked out of school for fighting. But he had at least one thing that Wrenn lacked: a stable, middle-class upbringing. His father was an equipment supervisor for General Electric, his mother a banker. And they helped him navigate his career.
In fact, Jordan’s life is filled with stories of his family guiding him away from the traps that a great, competitive talent can fall into. After Jordan was kicked out of school, his mother responded by taking him with her to work. He was not allowed to leave the car and instead had to sit there in the parking lot reading books. After he was drafted by the Chicago Bulls, his parents and siblings took turns visiting him to make sure he avoided the temptations that come with fame and money.
Jordan’s career did not end like Wrenn’s, with a little-read quote in the Seattle Times. It ended with a speech upon induction into the Basketball Hall of Fame that was watched by millions of people. In his speech, Jordan said he tried to stay “focused on the good things about life—you know how people perceive you, how you respect them . . . how you are perceived publicly. Take a pause and think about the things that you do. And that all came from my parents.”
The data tells us Jordan is absolutely right to thank his middle-class, married parents. The data tells us that in worse-off families, in worse-off communities, there are NBA-level talents who are not in the NBA. These men had the genes, had the ambition, but never developed the temperament to become basketball superstars.
And no—whatever we might intuit—being in circumstances so desperate that basketball seems “a matter of life or death” does not help. Stories like that of Doug Wrenn can help illustrate this. And data proves it.
In June 2013, LeBron James was interviewed on television after winning his second NBA championship. (He has since won a third.) “I’m LeBron James,” he announced. “From Akron, Ohio. From the inner city. I am not even supposed to be here.” Twitter and other social networks erupted with criticism. How could such a supremely gifted person, identified from an absurdly young age as the future of basketball, claim to be an underdog? In fact, anyone from a difficult environment, no matter his athletic prowess, has the odds stacked against him. James’s accomplishments, in other words, are even more exceptional than they appear to be at first. Data proves that, too.