My brother, Noah, is four years younger than I. Most people, upon first meeting us, find us eerily similar. We both talk too loudly, are balding in the same way, and have great difficulty keeping our apartments tidy.
But there are differences: I count pennies. Noah buys the best. I love Leonard Cohen and Bob Dylan. For Noah, it’s Cake and Beck.
Perhaps the most notable difference between us is our attitude toward baseball. I am obsessed with baseball and, in particular, my love of the New York Mets has always been a core part of my identity. Noah finds baseball impossibly boring, and his hatred of the sport has long been a core part of his identity.*
How can two guys with such similar genes, raised by the same parents, in the same town, have such opposite feelings about baseball? What determines the adults we become? More fundamentally, what’s wrong with Noah? There’s a growing field within developmental psychology that mines massive adult databases and correlates them with key childhood events. It can help us tackle this and related questions. We might call this increasing use of Big Data to answer psychological questions Big Psych.
To see how this works, let’s consider a study I conducted on how childhood experiences influence which baseball team you support—or whether you support any team at all. For this study, I used Facebook data on “likes” of baseball teams. (In the previous chapter I noted that Facebook data can be deeply misleading on sensitive topics. With this study, I am assuming that nobody, not even a Phillies fan, is embarrassed to acknowledge a rooting interest in a particular team on Facebook.)
To begin with, I downloaded the number of males of every age who “like” each of New York’s two baseball teams. Here are the percent that are Mets fans, by year of birth.
The higher the point, the more Mets fans. The popularity of the team rises and falls then rises and falls again, with the Mets being very popular among those born in 1962 and 1978. I’m guessing baseball fans might have an idea as to what’s going on here. The Mets have won just two World Series: in 1969 and 1986. These men were roughly seven to eight years old when the Mets won. Thus a huge predictor of Mets fandom, for boys at least, is whether the Mets won a World Series when they were around the age seven or eight.
In fact, we can extend this analysis. I downloaded information on Facebook showing how many fans of every age “like” every one of a comprehensive selection of Major League Baseball teams.
I found that there are also an unusually high number of male Baltimore Orioles fans born in 1962 and male Pittsburgh Pirates fans born in 1963. Those men were eight-year-old boys when these teams were champions. Indeed, calculating the age of peak fandom for all the teams I studied, then figuring out how old these fans would have been, gave me this chart:
Once again we see that the most important year in a man’s life, for the purposes of cementing his favorite baseball team as an adult, is when he is more or less eight years old. Overall, five to fifteen is the key period to win over a boy. Winning when a man is nineteen or twenty is about one-eighth as important in determining who he will root for as winning when he is eight. By then, he will already either love a team for life or he won’t.
You might be asking, what about women baseball fans? The patterns are much less sharp, but the peak age appears to be twenty-two years old.
This is my favorite study. It relates to two of my most beloved topics: baseball and the sources of my adult discontent. I was firmly hooked in 1986 and have been suffering along—rooting for the Mets—ever since. Noah had the good sense to be born four years later and was spared this pain.
Now, baseball is not the most important topic in the world, or so my Ph.D. advisors repeatedly told me. But this methodology might help us tackle similar questions, including how people develop their political preferences, sexual proclivities, musical taste, and financial habits. (I would be particularly interested on the origins of my brother’s wacky ideas on the latter two subjects.) My prediction is that we will find that many of our adult behaviors and interests, even those that we consider fundamental to who we are, can be explained by the arbitrary facts of when we were born and what was going on in certain key years while we were young.
Indeed, some work has already been done on the origin of political preferences. Yair Ghitza, chief scientist at Catalist, a data analysis company, and Andrew Gelman, a political scientist and statistician at Columbia University, tried to test the conventional idea that most people start out liberal and become increasingly conservative as they age. This is the view expressed in a famous quote often attributed to Winston Churchill: “Any man who is under 30, and is not a liberal, has no heart; and any man who is over 30, and is not a conservative, has no brains.”
Ghitza and Gelman pored through sixty years of survey data, taking advantage of more than 300,000 observations on voting preferences. They found, contrary to Churchill’s claim, that teenagers sometimes tilt liberal and sometimes tilt conservative. As do the middle-aged and the elderly.
These researchers discovered that political views actually form in a way not dissimilar to the way our sports team preferences do. There is a crucial period that imprints on people for life. Between the key ages of fourteen and twenty-four, numerous Americans will form their views based on the popularity of the current president. A popular Republican or unpopular Democrat will influence many young adults to become Republicans. An unpopular Republican or popular Democrat puts this impressionable group in the Democratic column.
And these views, in these key years, will, on average, last a lifetime.
To see how this works, compare Americans born in 1941 and those born a decade later.
Those in the first group came of age during the presidency of Dwight D. Eisenhower, a popular Republican. In the early 1960s, despite being under thirty, this generation strongly tilted toward the Republican Party. And members of this generation have consistently tilted Republican as they have aged.
Americans born ten years later—baby boomers—came of age during the presidencies of John F. Kennedy, an extremely popular Democrat; Lyndon B. Johnson, an initially popular Democrat; and Richard M. Nixon, a Republican who eventually resigned in disgrace. Members of this generation have tilted liberal their entire lives.
With all this data, the researchers were able to determine the single most important year for developing political views: age eighteen.
And they found that these imprint effects are substantial. Their model estimates that the Eisenhower experience resulted in about a 10 percentage point lifetime boost for Republicans among Americans born in 1941. The Kennedy, Johnson, and Nixon experience gave Democrats a 7 percentage point advantage among Americans born in 1952.
I’ve made it clear that I am skeptical of survey data, but I am impressed with the large number of responses examined here. In fact, this study could not have been done with one small survey. The researchers needed the hundreds of thousands of observations, aggregated from many surveys, to see how preferences change as people age.
Data size was also crucial for my baseball study. I needed to zoom in not only on fans of each team but on people of every age. Millions of observations are required to do this and Facebook and other digital sources routinely offer such numbers.
This is where the bigness of Big Data really comes into play. You need a lot of pixels in a photo in order to be able to zoom in with clarity on one small portion of it. Similarly, you need a lot of observations in a dataset in order to be able to zoom in with clarity on one small subset of that data—for example, how popular the Mets are among men born in 1978. A small survey of a couple of thousand people won’t have a large enough sample of such men.
This is the third power of Big Data: Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are. And we can zoom in on other dimensions besides age. If we have enough data, we can see how people in particular towns and cities behave. And we can see how people carry on hour-by-hour or even minute-by-minute.
In this chapter, human behavior gets its close-up.
In hindsight it’s surprising. But when Raj Chetty, then a professor at Harvard, and a small research team first got a hold of a rather large dataset—all Americans’ tax records since 1996—they were not certain anything would come of it. The IRS had handed over the data because they thought the researchers might be able to use it to help clarify the effects of tax policy.
The initial attempts Chetty and his team made to use this Big Data led, in fact, to numerous dead ends. Their investigations of the consequences of state and federal tax policies reached mostly the same conclusions everybody else had just by using surveys. Perhaps Chetty’s answers, using the hundreds of millions of IRS data points, were a bit more precise. But getting the same answers as everybody else, with a little more precision, is not a major social science accomplishment. It is not the type of work that top journals are eager to publish.
Plus, organizing and analyzing all the IRS data was time-consuming. Chetty and his team—drowning in data—were taking more time than everybody else to find the same answers as everybody else.
It was beginning to look like the Big Data skeptics were right. You didn’t need data for hundreds of millions of Americans to understand tax policy; a survey of ten thousand people was plenty. Chetty and his team were understandably discouraged.
And then, finally, the researchers realized their mistake. “Big Data is not just about doing the same thing you would have done with surveys except with more data,” Chetty explains. They were asking little data questions of the massive collection of data they had been handed. “Big Data really should allow you to use completely different designs than what you would have with a survey,” Chetty adds. “You can, for example, zoom in on geographies.”
In other words, with data on hundreds of millions of people, Chetty and his team could spot patterns among cities, towns, and neighborhoods, large and small.
As a graduate student at Harvard, I was in a seminar room when Chetty presented his initial results using the tax records of every American. Social scientists refer in their work to observations—how many data points they have. If a social scientist is working with a survey of eight hundred people, he would say, “We have eight hundred observations.” If he is working with a laboratory experiment with seventy people, he would say, “We have seventy observations.”
“We have one-point-two billion observations,” Chetty said, straight-faced. The audience giggled nervously.
Chetty and his coauthors began, in that seminar room and then in a series of papers, to give us important new insights into how America works.
Consider this question: is America a land of opportunity? Do you have a shot, if your parents are not rich, to become rich yourself?
The traditional way to answer this question is to look at a representative sample of Americans and compare this to similar data from other countries.
Here is the data for a variety of countries on equality of opportunity. The question asked: what is the chance that a person with parents in the bottom 20 percent of the income distribution reaches the top 20 percent of the income distribution?
CHANCES A PERSON WITH POOR PARENTS WILL BECOME RICH (SELECTED COUNTRIES)
As you can see, America does not score well.
But this simple analysis misses the real story. Chetty’s team zoomed in on geography. They found the odds differ a huge amount depending on where in the United States you were born.
CHANCES A PERSON WITH POOR PARENTS WILL BECOME RICH (SELECTED PARTS OF THE UNITED STATES)
San Jose, CA
United States Average
In some parts of the United States, the chance of a poor kid succeeding is as high as in any developed country in the world. In other parts of the United States, the chance of a poor kid succeeding is lower than in any developed country in the world.
These patterns would never be seen in a small survey, which might only include a few people in Charlotte and San Jose, and which therefore would prevent you from zooming in like this.
In fact, Chetty’s team could zoom in even further. Because they had so much data—data on every single American—they could even zoom in on the small groups of people who moved from city to city to see how that might have affected their prospects: those who moved from New York City to Los Angeles, Milwaukee to Atlanta, San Jose to Charlotte. This allowed them to test for causation, not just correlation (a distinction I’ll discuss in the next chapter). And, yes, moving to the right city in one’s formative years made a significant difference.
So is America a “land of opportunity”?
The answer is neither yes nor no. The answer is: some parts are, and some parts aren’t.
As the authors write, “The U.S. is better described as a collection of societies, some of which are ‘lands of opportunity’ with high rates of mobility across generations, and others in which few children escape poverty.”
So what is it about parts of the United States where there is high income mobility? What makes some places better at equaling the playing field, of allowing a poor kid to have a pretty good life? Areas that spend more on education provide a better chance to poor kids. Places with more religious people and lower crime do better. Places with more black people do worse. Interestingly, this has an effect on not just the black kids but on the white kids living there as well. Places with lots of single mothers do worse. This effect too holds not just for kids of single mothers but for kids of married parents living in places with lots of single mothers. Some of these results suggest that a poor kid’s peers matter. If his friends have a difficult background and little opportunity, he may struggle more to escape poverty.
The data tells us that some parts of America are better at giving kids a chance to escape poverty. So what places are best at giving people a chance to escape the grim reaper?
We like to think of death as the great equalizer. Nobody, after all, can avoid it. Not the pauper nor the king, the homeless man nor Mark Zuckerberg. Everybody dies.
But if the wealthy can’t avoid death, data tells us that they can now delay it. American women in the top 1 percent of income live, on average, ten years longer than American women in the bottom 1 percent of income. For men, the gap is fifteen years.
How do these patterns vary in different parts of the United States? Does your life expectancy vary based on where you live? Is this variation different for rich and poor people? Again, by zooming in on geography, Raj Chetty’s team found the answers.
Interestingly, for the wealthiest Americans, life expectancy is hardly affected by where they live. If you have excesses of money, you can expect to make it roughly eighty-nine years as a woman and about eighty-seven years as a man. Rich people everywhere tend to develop healthier habits—on average, they exercise more, eat better, smoke less, and are less likely to suffer from obesity. Rich people can afford the treadmill, the organic avocados, the yoga classes. And they can buy these things in any corner of the United States.
For the poor, the story is different. For the poorest Americans, life expectancy varies tremendously depending on where they live. In fact, living in the right place can add five years to a poor person’s life expectancy.
So why do some places seem to allow the impoverished to live so much longer? What attributes do cities where poor people live the longest share?
Here are four attributes of a city—three of them do not correlate with poor people’s life expectancy, and one of them does. See if you can guess which one matters.
WHAT MAKES POOR PEOPLE IN A CITY LIVE MUCH LONGER?
The city has a high level of religiosity.
The city has low levels of pollution.
The city has a higher percentage of residents covered by health insurance.
A lot of rich people live in the city.
The first three—religion, environment, and health insurance—do not correlate with longer life spans for the poor. The variable that does matter, according to Chetty and the others who worked on this study? How many rich people live in a city. More rich people in a city means the poor there live longer. Poor people in New York City, for example, live a lot longer than poor people in Detroit.
Why is the presence of rich people such a powerful predictor of poor people’s life expectancy? One hypothesis—and this is speculative—was put forth by David Cutler, one of the authors of the study and one of my advisors. Contagious behavior may be driving some of this.
There is a large amount of research showing that habits are contagious. So poor people living near rich people may pick up a lot of their habits. Some of these habits—say, pretentious vocabulary—aren’t likely to affect one’s health. Others—working out—will definitely have a positive impact. Indeed, poor people living near rich people exercise more, smoke less, and are less likely to suffer from obesity.
My personal favorite study by Raj Chetty’s team, which had access to that massive collection of IRS data, was their inquiry into why some people cheat on their taxes while others do not. Explaining this study is a bit more complicated.
The key is knowing that there is an easy way for self-employed people with one child to maximize the money they receive from the government. If you report that you had taxable income of exactly $9,000 in a given year, the government will write you a check for $1,377—that amount represents the Earned Income Tax Credit, a grant to supplement the earnings of the working poor, minus your payroll taxes. Report any more than that, and your payroll taxes will go up. Report any less than that, and the Earned Income Tax Credit drops. A taxable income of $9,000 is the sweet spot.
And, wouldn’t you know it, $9,000 is the most common taxable income reported by self-employed people with one child.
Did these Americans adjust their work schedules to make sure they earned the perfect income? Nope. When these workers were randomly audited—a very rare occurrence—it was almost always found that they made nowhere near $9,000—they earned either substantially less or substantially more.
In other words, they cheated on their taxes by pretending they made the amount that would give them the fattest check from the government.
So how typical was this type of tax fraud and who among the self-employed with one child was most likely to commit it? It turns out, Chetty and colleagues reported, that there were huge differences across the United States in how common this type of cheating was. In Miami, among people in this category, an astonishing 30 percent reported they made $9,000. In Philadelphia, just 2 percent did.
What predicts who is going to cheat? What is it about places that have the greater number of cheaters and those that have lower numbers? We can correlate rates of cheating with other city-level demographics and it turns out that there are two strong predictors: a high concentration of people in the area qualifying for the Earned Income Tax Credit and a high concentration of tax professionals in the neighborhood.
What do these factors indicate? Chetty and the authors had an explanation. The key motivator for cheating on your taxes in this manner was information.
Most self-employed one-kid taxpayers simply did not know that the magic number for getting a big fat check from the government was $9,000. But living near others who might—either their neighbors or tax assisters—dramatically increased the odds that they would learn about it.
In fact, Chetty’s team found even more evidence that knowledge drove this kind of cheating. When Americans moved from an area where this variety of tax fraud was low to an area where it was high, they learned and adopted the trick. Through time, cheating spread from region to region throughout the United States. Like a virus, cheating on taxes is contagious.
Now stop for a moment and think about how revealing this study is. It demonstrated that, when it comes to figuring out who will cheat on their taxes, the key isn’t determining who is honest and who is dishonest. It is determining who knows how to cheat and who doesn’t.
So when someone tells you they would never cheat on their taxes, there’s a pretty good chance that they are—you guessed it—lying. Chetty’s research suggests that many would if they knew how.
If you want to cheat on your taxes (and I am not recommending this), you should live near tax professionals or live near tax cheaters who can show you the way. If you want to have kids who are world-famous, where should you live? This ability to zoom in on data and get really granular can help answer this question, too.
I was curious where the most successful Americans come from, so one day I decided to download Wikipedia. (You can do that sort of thing nowadays.)
With a little coding, I had a dataset of more than 150,000 Americans deemed by Wikipedia’s editors to be notable enough to warrant an entry. The dataset included county of birth, date of birth, occupation, and gender. I merged it with county-level birth data gathered by the National Center for Health Statistics. For every county in the United States, I calculated the odds of making it into Wikipedia if you were born there.
Is being profiled in Wikipedia a meaningful marker of notable achievement? There are certainly limitations. Wikipedia’s editors skew young and male, which may bias the sample. And some types of notability are not particularly worthy. Ted Bundy, for example, rates a Wikipedia entry because he killed dozens of young women. That said, I was able to remove criminals without affecting the results much.
I limited the study to baby boomers (those born between 1946 and 1964) because they have had nearly a full lifetime to become notable. Roughly one in 2,058 American-born baby boomers were deemed notable enough to warrant a Wikipedia entry. About 30 percent made it through achievements in art or entertainment, 29 percent through sports, 9 percent via politics, and 3 percent in academia or science.
The first striking fact I noticed in the data was the enormous geographic variation in the likelihood of becoming a big success, at least on Wikipedia’s terms. Your chances of achieving notability were highly dependent on where you were born.
Roughly one in 1,209 baby boomers born in California reached Wikipedia. Only one in 4,496 baby boomers born in West Virginia did. Zoom in by county and the results become more telling. Roughly one in 748 baby boomers born in Suffolk County, Massachusetts, where Boston is located, made it to Wikipedia. In some other counties, the success rate was twenty times lower.
Why do some parts of the country appear to be so much better at churning out America’s movers and shakers? I closely examined the top counties. It turns out that nearly all of them fit into one of two categories.
First, and this surprised me, many of these counties contained a sizable college town. Just about every time I saw the name of a county that I had not heard of near the top of the list, like Washtenaw, Michigan, I found out that it was dominated by a classic college town, in this case Ann Arbor. The counties graced by Madison, Wisconsin; Athens, Georgia; Columbia, Missouri; Berkeley, California; Chapel Hill, North Carolina; Gainesville, Florida; Lexington, Kentucky; and Ithaca, New York, are all in the top 3 percent.
Why is this? Some of it is may well be due to the gene pool: sons and daughters of professors and graduate students tend to be smart (a trait that, in the game of big success, can be mighty useful). And, indeed, having more college graduates in an area is a strong predictor of the success of the people born there.
But there is most likely something more going on: early exposure to innovation. One of the fields where college towns are most successful in producing top dogs is music. A kid in a college town will be exposed to unique concerts, unusual radio stations, and even independent record stores. And this isn’t limited to the arts. College towns also incubate more than their expected share of notable businesspeople. Maybe early exposure to cutting-edge art and ideas helps them, too.
The success of college towns does not just cross regions. It crosses race. African-Americans were noticeably underrepresented on Wikipedia in nonathletic fields, especially business and science. This undoubtedly has a lot to do with discrimination. But one small county, where the 1950 population was 84 percent black, produced notable baby boomers at a rate near those of the highest counties.
Of fewer than 13,000 boomers born in Macon County, Alabama, fifteen made it to Wikipedia—or one in 852. Every single one of them is black. Fourteen of them were from the town of Tuskegee, home of Tuskegee University, a historically black college founded by Booker T. Washington. The list included judges, writers, and scientists. In fact, a black child born in Tuskegee had the same probability of becoming a notable in a field outside of sports as a white child born in some of the highest-scoring, majority-white college towns.
The second attribute most likely to make a county’s natives successful was the presence in that county of a big city. Being born in San Francisco County, Los Angeles County, or New York City all offered among the highest probabilities of making it to Wikipedia. (I grouped New York City’s five counties together because many Wikipedia entries did not specify a borough of birth.)
Urban areas tend to be well supplied with models of success. To see the value of being near successful practitioners of a craft when young, compare New York City, Boston, and Los Angeles. Among the three, New York City produces notable journalists at the highest rate; Boston produces notable scientists at the highest rate; and Los Angeles produces notable actors at the highest rate. Remember, we are talking about people who were born there, not people who moved there. And this holds true even after subtracting people with notable parents in that field.
Suburban counties, unless they contained major college towns, performed far worse than their urban counterparts. My parents, like many boomers, moved away from crowded sidewalks to tree-shaded streets—in this case from Manhattan to Bergen County, New Jersey—to raise their three children. This was potentially a mistake, at least from the perspective of having notable children. A child born in New York City is 80 percent more likely to make it into Wikipedia than a kid born in Bergen County. These are just correlations, but they do suggest that growing up near big ideas is better than growing up with a big backyard.
The stark effects identified here might be even stronger if I had better data on places lived throughout childhood, since many people grow up in different counties than the one where they were born.
The success of college towns and big cities is striking when you just look at the data. But I also delved more deeply to undertake a more sophisticated empirical analysis.
Doing so showed that there was another variable that was a strong predictor of a person’s securing an entry in Wikipedia: the proportion of immigrants in your county of birth. The greater the percentage of foreign-born residents in an area, the higher the proportion of children born there who go on to notable success. (Take that, Donald Trump!) If two places have similar urban and college populations, the one with more immigrants will produce more prominent Americans. What explains this?
A lot of it seems to be directly attributable to the children of immigrants. I did an exhaustive search of the biographies of the hundred most famous white baby boomers, according to the Massachusetts Institute of Technology’s Pantheon project, which is also working with Wikipedia data. Most of these were entertainers. At least thirteen had foreign-born mothers, including Oliver Stone, Sandra Bullock, and Julianne Moore. This rate is more than three times higher than the national average during this period. (Many had fathers who were immigrants, including Steve Jobs and John Belushi, but this data was more difficult to compare to national averages, since information on fathers is not always included on birth certificates.)
What about variables that don’t impact success? One that I found more than a little surprising was how much money a state spends on education. In states with similar percentages of its residents living in urban areas, education spending did not correlate with rates of producing notable writers, artists, or business leaders.
It is interesting to compare my Wikipedia study to one of Chetty’s team’s studies discussed earlier. Recall that Chetty’s team was trying to figure out what areas are good at allowing people to reach the upper middle class. My study was trying to figure out what areas are good at allowing people to reach fame. The results are strikingly different.
Spending a lot on education helps kids reach the upper middle class. It does little to help them become a notable writer, artist, or business leader. Many of these huge successes hated school. Some dropped out.
New York City, Chetty’s team found, is not a particularly good place to raise a child if you want to ensure he reaches the upper middle class. It is a great place, my study found, if you want to give him a chance at fame.
When you look at the factors that drive success, the large variation between counties begins to make sense. Many counties combine all the main ingredients for success. Return, again, to Boston. With numerous universities, it is stewing in innovative ideas. It is an urban area with many extremely accomplished people offering youngsters examples of how to make it. And it draws plenty of immigrants, whose children are driven to apply these lessons.
What if an area has none of these qualities? Is it destined to produce fewer superstars? Not necessarily. There is another path: extreme specialization. Roseau County, Minnesota, a small rural county with few foreigners and no major universities, is a good example. Roughly 1 in 740 people born here made it into Wikipedia. Their secret? All nine were professional hockey players, no doubt helped by the county’s world-class youth and high school hockey programs.
So is the point here—assuming you’re not so interested in raising a hockey star—to move to Boston or Tuskegee if you want to give your future children the utmost advantage? It can’t hurt. But there are larger lessons here. Usually, economists and sociologists focus on how to avoid bad outcomes, such as poverty and crime. Yet the goal of a great society is not only to leave fewer people behind; it is to help as many people as possible to really stand out. Perhaps this effort to zoom in on the places where hundreds of thousands of the most famous Americans were born can give us some initial strategies: encouraging immigration, subsidizing universities, and supporting the arts, among them.
Usually, I study the United States. So when I think of zooming in by geography, I think of zooming in on our cities and towns—of looking at places like Macon County, Alabama, and Roseau County, Minnesota. But another huge—and still growing—advantage of data from the internet is that it is easy to collect data from around the world. We can then see how countries differ. And data scientists get an opportunity to tiptoe into anthropology.
One somewhat random topic I recently explored: how does pregnancy play out in different countries around the world? I examined Google searches by pregnant women. The first thing I found was a striking similarity in the physical symptoms about which women complain.
I tested how often various symptoms were searched in combination with the word “pregnant.” For example, how often is “pregnant” searched in conjunction with “nausea,” “back pain,” or “constipation”? Canada’s symptoms were very close to those in the United States. Symptoms in countries like Britain, Australia, and India were all roughly similar, too.
Pregnant women around the world apparently also crave the same things. In the United States, the top Google search in this category is “craving ice during pregnancy.” The next four are salt, sweets, fruit, and spicy food. In Australia, those cravings don’t differ all that much: the list features salt, sweets, chocolate, ice, and fruit. What about India? A similar story: spicy food, sweets, chocolate, salt, and ice cream. In fact, the top five are very similar in all of the countries I looked at.
Preliminary evidence suggests that no part of the world has stumbled upon a diet or environment that drastically changes the physical experience of pregnancy.
But the thoughts that surround pregnancy most definitely do differ.
Start with questions about what pregnant women can safely do. The top questions in the United States: can pregnant women “eat shrimp,” “drink wine,” “drink coffee,” or “take Tylenol”?
When it comes to such concerns, other countries don’t have much in common with the United States or one another. Whether pregnant women can “drink wine” is not among the top ten questions in Canada, Australia, or Britain. Australia’s concerns are mostly related to eating dairy products while pregnant, particularly cream cheese. In Nigeria, where 30 percent of the population uses the internet, the top question is whether pregnant women can drink cold water.
Are these worries legitimate? It depends. There is strong evidence that pregnant women are at an increased risk of listeria from unpasteurized cheese. Links have been established between drinking too much alcohol and negative outcomes for the child. In some parts of the world, it is believed that drinking cold water can give your baby pneumonia; I don’t know of any medical support for this.
The huge differences in questions posed around the world are most likely caused by the overwhelming flood of information coming from disparate sources in each country: legitimate scientific studies, so-so scientific studies, old wives’ tales, and neighborhood chatter. It is difficult for women to know what to focus on—or what to Google.
We can see another clear difference when we look at the top searches for “how to ___ during pregnancy?” In the United States, Australia, and Canada, the top search is “how to prevent stretch marks during pregnancy.” But in Ghana, India, and Nigeria, preventing stretch marks is not even in the top five. These countries tend to be more concerned with how to have sex or how to sleep.
There is undoubtedly more to learn from zooming in on aspects of health and culture in different corners of the world. But my preliminary analysis suggests that Big Data will tell us that humans are even less powerful than we realized when it comes to transcending our biology. Yet we come up with remarkably different interpretations of what it all means.
“The adventures of a young man whose principal interests are rape, ultra-violence, and Beethoven.”
That was how Stanley Kubrick’s controversial A Clockwork Orange was advertised. In the movie, the fictional young protagonist, Alex DeLarge, committed shocking acts of violence with chilling detachment. In one of the film’s most notorious scenes, he raped a woman while belting out “Singin’ in the Rain.”
Almost immediately, there were reports of copycat incidents. Indeed, a group of men raped a seventeen-year-old girl while singing the same song. The movie was shut down in many European countries, and some of the more shocking scenes were removed for a version shown in America.
There are, in fact, many examples of real life imitating art, with men seemingly hypnotized by what they had just seen on-screen. A showing of the gang movie Colors was followed by a violent shooting. A showing of the gang movie New Jack City was followed by riots.
Perhaps most disturbing, four days after the release of The Money Train, men used lighter fluid to ignite a subway toll booth, almost perfectly mimicking a scene in the film. The only difference between the fictional and real-world arson: In the movie, the operator escaped. In real life, he burned to death.
There is also some evidence from psychological experiments that subjects exposed to a violent film will report more anger and hostility, even if they don’t precisely imitate one of the scenes.
In other words, anecdotes and experiments suggest violent movies can incite violent behavior. But how big an effect do they really have? Are we talking about one or two murders every decade or hundreds of murders every year? Anecdotes and experiments can’t answer this.
To see if Big Data could, two economists, Gordon Dahl and Stefano DellaVigna, merged together three Big Datasets for the years 1995 to 2004: FBI hourly crime data, box-office numbers, and a measure of the violence in every movie from kids-in-mind.com.
The information they were using was complete—every movie and every crime committed in every hour in cities throughout the United States. This would prove important.
Key to their study was the fact that on some weekends, the most popular movie was a violent one—Hannibal or Dawn of the Dead, for example—while on other weekends, the most popular movie was nonviolent, such as Runaway Bride or Toy Story.
The economists could see exactly how many murders, rapes, and assaults were committed on weekends when a prominent violent movie was released and compare that to the number of murders, rapes, and assaults there were on weekends when a prominent peaceful movie was released.
So what did they find? When a violent movie was shown, did crime rise, as some experiments suggest? Or did it stay the same?
On weekends with a popular violent movie, the economists found, crime dropped.
You read that right. On weekends with a popular violent movie, when millions of Americans were exposed to images of men killing other men, crime dropped—significantly.
When you get a result this strange and unexpected, your first thought is that you’ve done something wrong. Each author carefully went over the coding. No mistakes. Your second thought is that there is some other variable that will explain these results. They checked if time of year affected the results. It didn’t. They collected data on weather, thinking perhaps somehow this was driving the relationship. It wasn’t.
“We checked all our assumptions, everything we were doing,” Dahl told me. “We couldn’t find anything wrong.”
Despite the anecdotes, despite the lab evidence, and as bizarre as it seemed, showing a violent movie somehow caused a big drop in crime. How could this possibly be?
The key to figuring it out for Dahl and DellaVigna was utilizing their Big Data to zoom in closer. Survey data traditionally provided information that was annual or at best perhaps monthly. If we are really lucky, we might get data for a weekend. By comparison, as we’ve increasingly been using comprehensive datasets, rather than small-sample surveys, we have been able to home in by the hour and even the minute. This has allowed us to learn a lot more about human behavior.
Sometimes fluctuations over time are amusing, if not earth-shattering. EPCOR, a utility company in Edmonton, Canada, reported minute-by-minute water consumption data during the 2010 Olympic gold medal hockey match between the United States and Canada, which an estimated 80 percent of Canadians watched. The data tells us that shortly after each period ended, water consumption shot up. Toilets across Edmonton were clearly flushing.
Google searches can also be broken down by the minute, revealing some interesting patterns in the process. For example, searches for “unblocked games” soar at 8 A.M. on weekdays and stay high through 3 P.M., no doubt in response to schools’ attempts to block access to mobile games on school property without banning students’ cell phones.
Search rates for “weather,” “prayer,” and “news” peak before 5:30 A.M., evidence that most people wake up far earlier than I do. Search rates for “suicide” peak at 12:36 A.M. and are at the lowest levels around 9 A.M., evidence that most people are far less miserable in the morning than I am.
The data shows that the hours between 2 and 4 A.M. are prime time for big questions: What is the meaning of consciousness? Does free will exist? Is there life on other planets? The popularity of these questions late at night may be a result, in part, of cannabis use. Search rates for “how to roll a joint” peak between 1 and 2 A.M.
And in their large dataset, Dahl and DellaVigna could look at how crime changed by the hour on those movies weekends. They found that the drop in crime when popular violent movies were shown—relative to other weekends—began in the early evening. Crime was lower, in other words, before the violent scenes even started, when theatergoers may have just been walking in.
Can you guess why? Think, first, about who is likely to choose to attend a violent movie. It’s young men—particularly young, aggressive men.
Think, next, about where crimes tend to be committed. Rarely in a movie theater. There have been exceptions, most notably a 2012 premeditated shooting in a Colorado theater. But, by and large, men go to theaters unarmed and sit, silently.
Offer young, aggressive men the chance to see Hannibal, and they will go to the movies. Offer young, aggressive men Runaway Bride as their option, and they will take a pass and instead go out, perhaps to a bar, club, or a pool hall, where the incidence of violent crime is higher.
Violent movies keep potentially violent people off the streets.
Puzzle solved. Right? Not quite yet. There was one more strange thing in the data. The effects started right when the movies started showing; however, they did not stop after the movie ended and the theater closed. On evenings where violent movies were showing, crime was lower well into the night, from midnight to 6 A.M.
Even if crime was lower while the young men were in the movie theater, shouldn’t it rise after they left and were no longer preoccupied? They had just watched a violent movie, which experiments say makes people more angry and aggressive.
Can you think of any explanations for why crime still dropped after the movie ended? After much thought, the authors, who were crime experts, had another “Aha” moment. They knew that alcohol is a major contributor to crime. The authors had sat in enough movie theaters to know that virtually no theaters in the United States serve liquor. Indeed, the authors found that alcohol-related crimes plummeted in late-night hours after violent movies.
Of course, Dahl and DellaVigna’s results were limited. They could not, for instance, test the months-out, lasting effects—to see how long the drop in crime might last. And it’s still possible that consistent exposure to violent movies ultimately leads to more violence. However, their study does put the immediate impact of violent movies, which has been the main theme in these experiments, into perspective. Perhaps a violent movie does influence some people and make them unusually angry and aggressive. However, do you know what undeniably influences people in a violent direction? Hanging out with other potentially violent men and drinking.*
This makes sense now. But it didn’t make sense before Dahl and DellaVigna began analyzing piles of data.
One more important point that becomes clear when we zoom in: the world is complicated. Actions we take today can have distant effects, most of them unintended. Ideas spread—sometimes slowly; other times exponentially, like viruses. People respond in unpredictable ways to incentives.
These connections and relationships, these surges and swells, cannot be traced with tiny surveys or traditional data methods. The world, quite simply, is too complex and too rich for little data.
In June 2009, David “Big Papi” Ortiz looked like he was done. During the previous half decade, Boston had fallen in love with their Dominican-born slugger with the friendly smile and gapped teeth.
He had made five consecutive All-Star games, won an MVP Award, and helped end Boston’s eighty-six-year championship drought. But in the 2008 season, at the age of thirty-two, his numbers fell off. His batting average had dropped 68 points, his on-base percentage 76 points, his slugging percentage 114 points. And at the start of the 2009 season, Ortiz’s numbers were dropping further.
Here’s how Bill Simmons, a sportswriter and passionate Boston Red Sox fan, described what was happening in the early months of the 2009 season: “It’s clear that David Ortiz no longer excels at baseball. . . . Beefy sluggers are like porn stars, wrestlers, NBA centers and trophy wives: When it goes, it goes.” Great sports fans trust their eyes, and Simmons’s eyes told him Ortiz was finished. In fact, Simmons predicted he would be benched or released shortly.
Was Ortiz really finished? If you’re the Boston general manager, in 2009, do you cut him? More generally, how can we predict how a baseball player will perform in the future? Even more generally, how can we use Big Data to predict what people will do in the future?
A theory that will get you far in data science is this: look at what sabermetricians (those who have used data to study baseball) have done and expect it to spread out to other areas of data science. Baseball was among the first fields with comprehensive datasets on just about everything, and an army of smart people willing to devote their lives to making sense of that data. Now, just about every field is there or getting there. Baseball comes first; every other field follows. Sabermetrics eats the world.
The simplest way to predict a baseball player’s future is to assume he will continue performing as he currently is. If a player has struggled for the past 1.5 years, you might guess that he will struggle for the next 1.5 years.
By this methodology, Boston should have cut David Ortiz.
However, there might be more relevant information. In the 1980s, Bill James, who most consider the founder of sabermetrics, emphasized the importance of age. Baseball players, James found, peaked early—at around the age of twenty-seven. Teams tended to ignore just how much players decline as they age. They overpaid for aging players.
By this more advanced methodology, Boston should definitely have cut David Ortiz.
But this age adjustment might miss something. Not all players follow the same path through life. Some players might peak at twenty-three, others at thirty-two. Short players may age differently from tall players, fat players from skinny players. Baseball statisticians found that there were types of players, each following a different aging path. This story was even worse for Ortiz: “beefy sluggers” indeed do, on average, peak early and collapse shortly past thirty.
If Boston considered his recent past, his age, and his size, they should, without a doubt, have cut David Ortiz.
Then, in 2003, statistician Nate Silver introduced a new model, which he called PECOTA, to predict player performance. It proved to be the best—and, also, the coolest. Silver searched for players’ doppelgangers. Here’s how it works. Build a database of every Major League Baseball player ever, more than 18,000 men. And include everything you know about those players: their height, age, and position; their home runs, batting average, walks, and strikeouts for each year of their careers. Now, find the twenty ballplayers who look most similar to Ortiz right up until that point in his career—those who played like he did when he was 24, 25, 26, 27, 28, 29, 30, 31, 32, and 33. In other words, find his doppelgangers. Then see how Ortiz’s doppelgangers’ careers progressed.
A doppelganger search is another example of zooming in. It zooms in on the small subset of people most similar to a given person. And, as with all zooming in, it gets better the more data you have. It turns out, Ortiz’s doppelgangers gave a very different prediction for Ortiz’s future. Ortiz’s doppelgangers included Jorge Posada and Jim Thome. These players started their careers a bit slow; had amazing bursts in their late twenties, with world-class power; and then struggled in their early thirties.
Silver then predicted how Ortiz would do based on how these doppelgangers ended up doing. And here’s what he found: they regained their power. For trophy wives, Simmons may be right: when it goes, it goes. But for Ortiz’s doppelgangers, when it went, it came back.
The doppelganger search, the best methodology ever used to predict baseball player performance, said Boston should be patient with Ortiz. And Boston indeed was patient with their aging slugger. In 2010, Ortiz’s average rose to .270. He hit 32 home runs and made the All-Star team. This began a string of four consecutive All-Star games for Ortiz. In 2013, batting in his traditional third spot in the lineup, at the age of thirty-seven, Ortiz batted .688 as Boston defeated St. Louis, 4 games to 2, in the World Series. Ortiz was voted World Series MVP.*
As soon as I finished reading Nate Silver’s approach to predicting the trajectory of ballplayers, I immediately began thinking about whether I might have a doppelganger, too.
Doppelganger searches are promising in many fields, not just athletics. Could I find the person who shares the most interests with me? Maybe if I found the person most similar to me, we could hang out. Maybe he would know some restaurants we would like. Maybe he could introduce me to things I had no idea I might have an affinity for.
A doppelganger search zooms in on individuals and even on the traits of individuals. And, as with all zooming in, it gets sharper the more data you have. Suppose I searched for my doppelganger in a dataset of ten or so people. I might find someone who shared my interest in books. Suppose I searched for my doppelganger in a dataset of a thousand or so people. I might find someone who had a thing for popular physics books. But suppose I searched for my doppelganger in a dataset of hundreds of millions of people. Then I might be able to find someone who was really, truly similar to me.
One day, I went doppelganger hunting on social media. Using the entire corpus of Twitter profiles, I looked for the people on the planet who have the most common interests with me.
You can certainly tell a lot about my interests from whom I follow on my Twitter account. Overall, I follow some 250 people, showing my passions for sports, politics, comedy, science, and morose Jewish folksingers.
So is there anybody out there in the universe who follows all 250 of these accounts, my Twitter twin? Of course not. Doppelgangers aren’t identical to us, only similar. Nor is there anybody who follows 200 of the accounts I follow. Or even 150.
However, I did eventually find an account that followed an amazing 100 of the accounts I follow: Country Music Radio Today. Huh? It turns out, Country Music Radio Today was a bot (it no longer exists) that followed 750,000 Twitter profiles in the hope that they would follow back.
I have an ex-girlfriend who I suspect would get a kick out of this result. She once told me I was more like a robot than a human being.
All joking aside, my initial finding that my doppelganger was a bot that followed 750,000 random accounts does make an important point about doppelganger searches. For a doppelganger search to be truly accurate, you don’t want to find someone who merely likes the same things you like. You also want to find someone who dislikes the things you dislike.
My interests are apparent not just from the accounts I follow but from those I choose not to follow. I am interested in sports, politics, comedy, and science but not food, fashion, or theater. My follows show that I like Bernie Sanders but not Elizabeth Warren, Sarah Silverman but not Amy Schumer, the New Yorker but not the Atlantic, my friends Noah Popp, Emily Sands, and Josh Gottlieb but not my friend Sam Asher. (Sorry, Sam. But your Twitter feed is a snooze.)
Of all 200 million people on Twitter, who has the most similar profile to me? It turns out my doppelganger is Vox writer Dylan Matthews. This was kind of a letdown, for the purposes of improving my media consumption, as I already follow Matthews on Twitter and Facebook and compulsively read his Vox posts. So learning he was my doppelganger hasn’t really changed my life. But it’s still pretty cool to know the person most similar to you in the world, especially if it’s someone you admire. And when I finish this book and stop being a hermit, maybe Matthews and I can hang out and discuss the writings of James Surowiecki.
The Ortiz doppelganger search was neat for baseball fans. And my doppelganger search was entertaining, at least to me. But what else can these searches reveal? For one thing, doppelganger searches have been used by many of the biggest internet companies to dramatically improve their offerings and user experience. Amazon uses something like a doppelganger search to suggest what books you might like. They see what people similar to you select and base their recommendations on that.
Pandora does the same in picking what songs you might want to listen to. And this is how Netflix figures out the movies you might like. The impact has been so profound that when Amazon engineer Greg Linden originally introduced doppelganger searches to predict readers’ book preferences, the improvement in recommendations was so good that Amazon founder Jeff Bezos got to his knees and shouted, “I’m not worthy!” to Linden.
But what is really interesting about doppelganger searches, considering their power, is not how they’re commonly being used now. It is how frequently they are not used. There are major areas of life that could be vastly improved by the kind of personalization these searches allow. Take our health, for instance.
Isaac Kohane, a computer scientist and medical researcher at Harvard, is trying to bring this principle to medicine. He wants to organize and collect all of our health information so that instead of using a one-size-fits-all approach, doctors can find patients just like you. Then they can employ more personalized, more focused diagnoses and treatments.
Kohane considers this a natural extension for the medical field and not even a particularly radical one. “What is a diagnosis?” Kohane asks. “A diagnosis really is a statement that you share properties with previously studied populations. When I diagnose you with a heart attack, God forbid, I say you have a pathophysiology that I learned from other people means you have had a heart attack.”
A diagnosis is, in essence, a primitive kind of doppelganger search. The problem is that the datasets doctors use to make their diagnoses are small. These days a diagnosis is based on a doctor’s experience with the population of patients he or she has treated and perhaps supplemented by academic papers from small populations that other researchers have encountered. As we’ve seen, though, for a doppelganger search to really get good, it would have to include many more cases.
Here is a field where some Big Data could really help. So what’s taking so long? Why isn’t it already widely used? The problem lies with data collection. Most medical reports still exist on paper, buried in files, and for those that are computerized, they’re often locked up in incompatible formats. We often have better data, Kohane notes, on baseball than on health. But simple measures would go a long way. Kohane talks repeatedly of “low-hanging fruit.” He believes, for instance, that merely creating a complete dataset of children’s height and weight charts and any diseases they might have would be revolutionary for pediatrics. Each child’s growth path then could be compared to every other child’s growth path. A computer could find children who were on a similar trajectory and automatically flag any troubling patterns. It might detect a child’s height leveling off prematurely, which in certain scenarios would likely point to one of two possible causes: hypothyroidism or a brain tumor. Early diagnosis in both cases would be a huge boon. “These are rare birds,” according to Kohane, “one-in-ten-thousand kind of events. Children, by and large, are healthy. I think we could diagnose them earlier, at least a year earlier. One hundred percent, we could.”
James Heywood is an entrepreneur who has a different approach to deal with difficulties linking medical data. He created a website, PatientsLikeMe.com, where individuals can report their own information—their conditions, treatments, and side effects. He’s already had a lot of success charting the varying courses diseases can take and how they compare to our common understanding of them.
His goal is to recruit enough people, covering enough conditions, so that people can find their health doppelganger. Heywood hopes that you can find people of your age and gender, with your history, reporting symptoms similar to yours—and see what has worked for them. That would be a very different kind of medicine, indeed.
In many ways the act of zooming in is more valuable to me than the particular findings of a particular study, because it offers a new way of seeing and talking about life.
When people learn that I am a data scientist and a writer, they sometimes will share some fact or survey with me. I often find this data boring—static and lifeless. It has no story to tell.
Likewise, friends have tried to get me to join them in reading novels and biographies. But these hold little interest for me as well. I always find myself asking, “Would that happen in other situations? What’s the more general principle?” Their stories feel small and unrepresentative.
What I have tried to present in this book is something that, for me, is like nothing else. It is based on data and numbers; it is illustrative and far-reaching. And yet the data is so rich that you can visualize the people underneath it. When we zoom in on every minute of Edmonton’s water consumption, I see the people getting up from their couch at the end of the period. When we zoom in on people moving from Philadelphia to Miami and starting to cheat on their taxes, I see these people talking to their neighbors in their apartment complex and learning about the tax trick. When we zoom in on baseball fans of every age, I see my own childhood and my brother’s childhood and millions of adult men still crying over a team that won them over when they were eight years old.
At the risk of once again sounding grandiose, I think the economists and data scientists featured in this book are creating not only a new tool but a new genre. What I have tried to present in this chapter, and much of this book, is data so big and so rich, allowing us to zoom in so close that, without limiting ourselves to any particular, unrepresentative human being, we can still tell complex and evocative stories.