February 27, 2000, started as an ordinary day on Google’s Mountain View campus. The sun was shining, the bikers were pedaling, the masseuses were massaging, the employees were hydrating with cucumber water. And then, on this ordinary day, a few Google engineers had an idea that unlocked the secret that today drives much of the internet. The engineers found the best way to get you clicking, coming back, and staying on their sites.
Before describing what they did, we need to talk about correlation versus causality, a huge issue in data analysis—and one that we have not yet adequately addressed.
The media bombard us with correlation-based studies seemingly every day. For example, we have been told that those of us who drink a moderate amount of alcohol tend to be in better health. That is a correlation.
Does this mean drinking a moderate amount will improve one’s health—a causation? Perhaps not. It could be that good health causes people to drink a moderate amount. Social scientists call this reverse causation. Or it could be that there is an independent factor that causes both moderate drinking and good health. Perhaps spending a lot of time with friends leads to both moderate alcohol consumption and good health. Social scientists call this omitted-variable bias.
How, then, can we more accurately establish causality? The gold standard is a randomized, controlled experiment. Here’s how it works. You randomly divide people into two groups. One, the treatment group, is asked to do or take something. The other, the control group, is not. You then see how each group responds. The difference in the outcomes between the two groups is your causal effect.
For example, to test whether moderate drinking causes good health, you might randomly pick some people to drink one glass of wine per day for a year, randomly choose others to drink no alcohol for a year, and then compare the reported health of both groups. Since people were randomly assigned to the two groups, there is no reason to expect one group would have better initial health or have socialized more. You can trust that the effects of the wine are causal. Randomized, controlled experiments are the most trusted evidence in any field. If a pill can pass a randomized, controlled experiment, it can be dispensed to the general populace. If it cannot pass this test, it won’t make it onto pharmacy shelves.
Randomized experiments have increasingly been used in the social sciences as well. Esther Duflo, a French economist at MIT, has led the campaign for greater use of experiments in developmental economics, a field that tries to figure out the best ways to help the poorest people in the world. Consider Duflo’s study, with colleagues, of how to improve education in rural India, where more than half of middle school students cannot read a simple sentence. One potential reason students struggle so much is that teachers don’t show up consistently. On a given day in some schools in rural India, more than 40 percent of teachers are absent.
Duflo’s test? She and her colleagues randomly divided schools into two groups. In one (the treatment group), in addition to their base pay, teachers were paid a small amount—50 rupees, or about $1.15—for every day they showed up to work. In the other, no extra payment for attendance was given. The results were remarkable. When teachers were paid, teacher absenteeism dropped in half. Student test performance also improved substantially, with the biggest effects on young girls. By the end of the experiment, girls in schools where teachers were paid to come to class were 7 percentage points more likely to be able to write.
According to a New Yorker article, when Bill Gates learned of Duflo’s work, he was so impressed he told her, “We need to fund you.”
So randomized experiments are the gold standard for proving causality, and their use has spread through the social sciences. Which brings us back to Google’s offices on February 27, 2000. What did Google do on that day that revolutionized the internet?
On that day, a few engineers decided to perform an experiment on Google’s site. They randomly divided users into two groups. The treatment group was shown twenty links on the search results pages. The control group was shown the usual ten. The engineers then compared the satisfaction of the two groups based on how frequently they returned to Google.
This is a revolution? It doesn’t seem so revolutionary. I already noted that randomized experiments have been used by pharmaceutical companies and social scientists. How can copying them be such a big deal?
The key point—and this was quickly realized by the Google engineers—is that experiments in the digital world have a huge advantage relative to experiments in the offline world. As convincing as offline randomized experiments can be, they are also resource-intensive. For Duflo’s study, schools had to be contacted, funding had to be arranged, some teachers had to be paid, and all students had to be tested. Offline experiments can cost thousands or hundreds of thousands of dollars and take months or years to conduct.
In the digital world, randomized experiments can be cheap and fast. You don’t need to recruit and pay participants. Instead, you can write a line of code to randomly assign them to a group. You don’t need users to fill out surveys. Instead, you can measure mouse movements and clicks. You don’t need to hand-code and analyze the responses. You can build a program to automatically do that for you. You don’t have to contact anybody. You don’t even have to tell users they are part of an experiment.
This is the fourth power of Big Data: it makes randomized experiments, which can find truly causal effects, much, much easier to conduct—anytime, more or less anywhere, as long as you’re online. In the era of Big Data all the world’s a lab.
This insight quickly spread through Google and then the rest of Silicon Valley, where randomized controlled experiments have been renamed “A/B testing.” In 2011, Google engineers ran seven thousand A/B tests. And this number is only rising.
If Google wants to know how to get more people to click on ads on their sites, they may try two shades of blue in ads—one shade for Group A, another for Group B. Google can then compare click rates. Of course, the ease of such testing can lead to overuse. Some employees felt that because testing was so effortless, Google was overexperimenting. In 2009, one frustrated designer quit after Google went through forty-one marginally different shades of blue in A/B testing. But this designer’s stand in favor of art over obsessive market research has done little to stop the spread of the methodology.
Facebook now runs a thousand A/B tests per day, which means that a small number of engineers at Facebook start more randomized, controlled experiments in a given day than the entire pharmaceutical industry starts in a year.
A/B testing has spread beyond the biggest tech firms. A former Google employee, Dan Siroker, brought this methodology to Barack Obama’s first presidential campaign, which A/B-tested home page designs, email pitches, and donation forms. Then Siroker started a new company, Optimizely, which allows organizations to perform rapid A/B testing. In 2012, Optimizely was used by Obama as well as his opponent, Mitt Romney, to maximize sign-ups, volunteers, and donations. It’s also used by companies as diverse as Netflix, TaskRabbit, and New York magazine.
To see how valuable testing can be, consider how Obama used it to get more people engaged with his campaign. Obama’s home page initially included a picture of the candidate and a button below the picture that invited people to “Sign Up.”
Was this the best way to greet people? With the help of Siroker, Obama’s team could test whether a different picture and button might get more people to actually sign up. Would more people click if the home page instead featured a picture of Obama with a more solemn face? Would more people click if the button instead said “Join Now”? Obama’s team showed users different combinations of pictures and buttons and measured how many of them clicked the button. See if you can predict the winning picture and winning button.
The winner was the picture of Obama’s family and the button “Learn More.” And the victory was huge. By using that combination, Obama’s campaign team estimated it got 40 percent more people to sign up, netting the campaign roughly $60 million in additional funding.
There is another great benefit to the fact that all this gold-standard testing can be done so cheap and easy: it further frees us from our reliance upon our intuition, which, as noted in Chapter 1, has its limitations. A fundamental reason for A/B testing’s importance is that people are unpredictable. Our intuition often fails to predict how they will respond.
Was your intuition correct on Obama’s optimal website?
Here are some more tests for your intuition. The Boston Globe A/B-tests headlines to figure out which ones get the most people to click on a story. Try to guess the winners from these pairs:
Finished your guesses? The answers are in bold below.
I predict you got more than half right, perhaps by considering what you would click on. But you probably did not guess all of these correctly.
Why? What did you miss? What insights into human behavior did you lack? What lessons can you learn from your mistakes?
We usually ask questions such as these after making bad predictions.
But look how difficult it is to draw general conclusions from the Globe headlines. In the first headline test, changing a single word, “this” to “SnotBot,” led to a big win. This might suggest more details win. But in the second headline, “deflated balls,” the detailed term, loses. In the fourth headline, “makes bank” beats the number $179,000. This might suggest slang terms win. But the slang term “hookup contest” loses in the third headline.
The lesson of A/B testing, to a large degree, is to be wary of general lessons. Clark Benson is the CEO of ranker.com, a news and entertainment site that relies heavily on A/B testing to choose headlines and site design. “At the end of the day, you can’t assume anything,” Benson says. “Test literally everything.”
Testing fills in gaps in our understanding of human nature. These gaps will always exist. If we knew, based on our life experience, what the answer would be, testing would not be of value. But we don’t, so it is.
Another reason A/B testing is so important is that seemingly small changes can have big effects. As Benson puts it, “I’m constantly amazed with minor, minor factors having outsized value in testing.”
In December 2012, Google changed its advertisements. They added a rightward-pointing arrow surrounded by a square.
Notice how bizarre this arrow is. It points rightward to absolutely nothing. In fact, when these arrows first appeared, many Google customers were critical. Why were they adding meaningless arrows to the ad, they wondered?
Well, Google is protective of its business secrets, so they don’t say exactly how valuable the arrows were. But they did say that these arrows had won in A/B testing. The reason Google added them is that they got a lot more people to click. And this minor, seemingly meaningless change made Google and their ad partners oodles of money.
So how can you find these small tweaks that produce outsize profits? You have to test lots of things, even many that seem trivial. In fact, Google’s users have noted numerous times that ads have been changed a tiny bit only to return to their previous form. They have unwittingly become members of treatment groups in A/B tests—but at the cost only of seeing these slight variations.
Centering Experiment (Didn’t Work)
Green Star Experiment (Didn’t Work)
New Font Experiment (Didn’t Work)
The above variations never made it to the masses. They lost. But they were part of the process of picking winners. The road to a clickable arrow is paved with ugly stars, faulty positionings, and gimmicky fonts.
It may be fun to guess what makes people click. And if you are a Democrat, it might be nice to know that testing got Obama more money. But there is a dark side to A/B testing.
In his excellent book Irresistible, Adam Alter writes about the rise of behavioral addictions in contemporary society. Many people are finding aspects of the internet increasingly difficult to turn off.
My favorite dataset, Google searches, can give us some clues as to what people find most addictive. According to Google, most addictions remain the ones people have struggled with for many decades—drugs, sex, and alcohol, for example. But the internet is starting to make its presence felt on the list—with “porn” and “Facebook” now among the top ten reported addictions.
TOP ADDICTIONS REPORTED TO GOOGLE, 2016
A/B testing may be playing a role in making the internet so darn addictive.
Tristan Harris, a “design ethicist,” was quoted in Irresistible explaining why people have such a hard time resisting certain sites on the internet: “There are a thousand people on the other side of the screen whose job it is to break down the self-regulation you have.”
And these people are using A/B testing.
Through testing, Facebook may figure out that making a particular button a particular color gets people to come back to their site more often. So they change the button to that color. Then they may figure out that a particular font gets people to come back to their site more often. So they change the text to that font. Then they may figure out that emailing people at a certain time gets them coming back to their site more often. So they email people at that time.
Pretty soon, Facebook becomes a site optimized to maximize how much time people spend on Facebook. In other words, find enough winners of A/B tests and you have an addictive site. It is the type of feedback that cigarette companies never had.
A/B testing is increasingly a tool of the gaming industry. As Alter discusses, World of Warcraft A/B-tests various versions of its game. One mission might ask you to kill someone. Another might ask you to save something. Game designers can give different samples of players’ different missions and then see which ones keep more people playing. They might find, for example, that the mission that asked you to save a person got people to return 30 percent more often. If they test many, many missions, they start finding more and more winners. These 30 percent wins add up, until they have a game that keeps many adult men holed up in their parents’ basement.
If you are a little disturbed by this, I am with you. And we will talk a bit more about the ethical implications of this and other aspects of Big Data near the end of this book. But for better or worse, experimentation is now a crucial tool in the data scientists’ tool kit. And there is another form of experimentation sitting in that tool kit. It has been used to ask a variety of questions, including whether TV ads really work.
It’s January 22, 2012, and the New England Patriots are playing the Baltimore Ravens in the AFC Championship game.
There’s a minute left in the game. The Ravens are down, but they’ve got the ball. The next sixty seconds will determine which team will play in the Super Bowl. The next sixty seconds will help seal players’ legacies. And the last minute of this game will do something that, for an economist, is far more profound: the last sixty seconds will help finally tell us, once and for all, Do advertisements work?
The notion that ads improve sales is obviously crucial to our economy. But it is maddeningly hard to prove. In fact, this is a textbook example of exactly how difficult it is to distinguish between correlation and causation.
There’s no doubt that products that advertise the most also have the highest sales. Twentieth Century Fox spent $150 million marketing the movie Avatar, which became the highest-grossing film of all time. But how much of the $2.7 billion in Avatar ticket sales was due to the heavy marketing? Part of the reason 20th Century Fox spent so much money on promotion was presumably that they knew they had a desirable product.
Firms believe they know how effective their ads are. Economists are skeptical they really do. University of Chicago economics professor Steven Levitt, while collaborating with an electronics company, was underwhelmed when the firm tried to convince him they knew how much their ads worked. How, Levitt wondered, could they be so confident?
The company explained that, every year, in the days preceding Father’s Day, they ramp up their TV ad spending. Sure enough, every year, before Father’s Day, they have the highest sales. Uh, maybe that’s just because a lot of kids buy electronics for their dads, particularly for Father’s Day gifts, regardless of advertising.
“They got the causality completely backwards,” says Levitt in a lecture. At least they might have. We don’t know. “It’s a really hard problem,” Levitt adds.
As important as this problem is to solve, firms are reluctant to conduct rigorous experiments. Levitt tried to convince the electronics company to perform a randomized, controlled experiment to precisely learn how effective their TV ads were. Since A/B testing isn’t possible on television yet, this would require seeing what happens without advertising in some areas.
Here’s how the firm responded: “Are you crazy? We can’t not advertise in twenty markets. The CEO would kill us.” That ended Levitt’s collaboration with the company.
Which brings us back to this Patriots-Ravens game. How can the results of a football game help us determine the causal effects of advertising? Well, it can’t tell us the effects of a particular ad campaign from a particular company. But it can give evidence on the average effects of advertisements from many large campaigns.
It turns out, there is a hidden advertising experiment in games like this. Here’s how it works. By the time these championship games are played, companies have purchased, and produced, their Super Bowl advertisements. When businesses decide which ads to run, they don’t know which teams will play in the game.
But the results of the playoffs will have a huge impact on who actually watches the Super Bowl. The two teams that ultimately qualify will bring with them an enormous amount of viewers. If New England, which plays near Boston, wins, far more people in Boston will watch the Super Bowl than folks in Baltimore. And vice versa.
To the firms, it is the equivalent of a coin flip to determine whether tens of thousands of extra people in Baltimore or Boston will be exposed to their advertisement, a flip that will happen after their spots are purchased and produced.
Now, back to the field, where Jim Nantz on CBS is announcing the final results of this experiment.
Here comes Billy Cundiff, to tie this game, and, in all likelihood, send it to overtime. The last two years, sixteen of sixteen on field goals. Thirty-two yards to tie it. And the kick. Look out! Look out! It’s no good. . . . And the Patriots take the knee and will now take the journey to Indianapolis. They’re heading to Super Bowl Forty-Six.
Two weeks later, Super Bowl XLVI would score a 60.3 audience share in Boston and a 50.2 share in Baltimore. Sixty thousand more people in Boston would watch the 2012 advertisements.
The next year, the same two teams would meet for the AFC Championship. This time, Baltimore would win. The extra ad exposures for the 2013 Super Bowl advertisements would be seen in Baltimore.
Hal Varian, chief economist at Google; Michael D. Smith, economist at Carnegie Mellon; and I used these two games and all the other Super Bowls from 2004 to 2013 to test whether—and, if so, how much—Super Bowl ads work. Specifically we looked at whether when a company advertises a movie in the Super Bowl, they see a big jump in ticket sales in the cities that had higher viewership for the game.
They indeed do. People in cities of teams that qualify for the Super Bowl attend movies that were advertised during the Super Bowl at a significantly higher rate than do those in cities of teams that just missed qualifying. More people in those cities saw the ad. More people in those cities decided to go to the film.
One alternative explanation might be that having a team in the Super Bowl makes you more likely to go see movies. However, we tested a group of movies that had similar budgets and were released at similar times but that did not advertise in the Super Bowl. There was no increased attendance in the cities of the Super Bowl teams.
Okay, as you might have guessed, advertisements work. This isn’t too surprising.
But it’s not just that they work. The ads were incredibly effective. In fact, when we first saw the results, we double- and triple- and quadruple-checked them to make sure they were right—because the returns were so large. The average movie in our sample paid about $3 million for a Super Bowl ad slot. They got $8.3 million in increased ticket sales, a 2.8-to-1 return on their investment.
This result was confirmed by two other economists, Wesley R. Hartmann and Daniel Klapper, who independently and earlier came up with a similar idea. These economists studied beer and soft drink ads run during the Super Bowl, while also utilizing the increased ad exposures in the cities of teams that qualify. They found a 2.5-to-1 return on investment. As expensive as these Super Bowl ads are, our results and theirs suggest they are so effective in upping demand that companies are actually dramatically underpaying for them.
And what does all of this mean for our friends back at the electronics company Levitt had worked with? It’s possible that Super Bowl ads are more cost-effective than other forms of advertising. But at the very least our study does suggest that all that Father’s Day advertising is probably a good idea.
One virtue of the Super Bowl experiment is that it wasn’t necessary to intentionally assign anyone to treatment or control groups. It happened based on the lucky bounces in a football game. It happened, in other words, naturally. Why is that an advantage? Because nonnatural, randomly controlled experiments, while super-powerful and easier to do in the digital age, still are not always possible.
Sometimes we can’t get our act together in time. Sometimes, as with that electronics company that didn’t want to run an experiment on its ad campaign, we are too invested in the answer to test it.
Sometimes experiments are impossible. Suppose you are interested in how a country responds to losing a leader. Does it go to war? Does its economy stop functioning? Does nothing much change? Obviously, we can’t just kill a significant number of presidents and prime ministers and see what happens. That would be not only impossible but immoral. Universities have built up, over many decades, institutional review boards (IRBs) that determine if a proposed experiment is ethical.
So if we want to know causal effects in a certain scenario and it is unethical or otherwise unfeasible to do an experiment, what can we do? We can utilize what economists—defining nature broadly enough to include football games—call natural experiments.
For better or worse (okay, clearly worse), there is a huge random component to life. Nobody knows for sure what or who is in charge of the universe. But one thing is clear: whoever is running the show—the laws of quantum mechanics, God, a pimply kid in his underwear simulating the universe on his computer—they, She, or he is not going through IRB approval.
Nature experiments on us all the time. Two people get shot. One bullet stops just short of a vital organ. The other doesn’t. These bad breaks are what make life unfair. But, if it is any consolation, the bad breaks do make life a little easier for economists to study. Economists use the arbitrariness of life to test for causal effects.
Of forty-three American presidents, sixteen have been victims of serious assassination attempts, and four have been killed. The reasons that some lived were essentially random.
Compare John F. Kennedy and Ronald Reagan. Both men had bullets headed directly for their most vulnerable body parts. JFK’s bullet exploded his brain, killing him shortly afterward. Reagan’s bullet stopped centimeters short of his heart, allowing doctors to save his life. Reagan lived, while JFK died, with no rhyme or reason—just luck.
These attempts on leaders’ lives and the arbitrariness with which they live or die is something that happens throughout the world. Compare Akhmad Kadyrov, of Chechyna, and Adolf Hitler, of Germany. Both men have been inches away from a fully functioning bomb. Kadyrov died. Hitler had changed his schedule, wound up leaving the booby-trapped room a few minutes early to catch a train, and thus survived.
And we can use nature’s cold randomness—killing Kennedy but not Reagan—to see what happens, on average, when a country’s leader is assassinated. Two economists, Benjamin F. Jones and Benjamin A. Olken, did just that. The control group here is any country in the years immediately after a near-miss assassination—for example, the United States in the mid-1980s. The treatment group is any country in the years immediately after a completed assassination—for example, the United States in the mid-1960s.
What, then, is the effect of having your leader murdered? Jones and Olken found that successful assassinations dramatically alter world history, taking countries on radically different paths. A new leader causes previously peaceful countries to go to war and previously warring countries to achieve peace. A new leader causes economically booming countries to start busting and economically busting countries to start booming.
In fact, the results of this assassination-based natural experiment overthrew a few decades of conventional wisdom on how countries function. Many economists previously leaned toward the view that leaders largely were impotent figureheads pushed around by external forces. Not so, according to Jones and Olken’s analysis of nature’s experiment.
Many would not consider this examination of assassination attempts on world leaders an example of Big Data. The number of assassinated or almost assassinated leaders in the study was certainly small—as was the number of wars that did or did not result. The economic datasets necessary to characterize the trajectory of an economy were large but for the most part predate digitalization.
Nonetheless, such natural experiments—though now used almost exclusively by economists—are powerful and will take on increasing importance in an era with more, better, and larger datasets. This is a tool that data scientists will not long forgo.
And yes, as should be clear by now, economists are playing a major role in the development of data science. At least I’d like to think so, since that was my training.
Where else can we find natural experiments—in other words, situations where the random course of events places people in treatment and control groups?
The clearest example is a lottery, which is why economists love them—not playing them, which we find irrational, but studying them. If a Ping-Pong ball with a three on it rises to the top, Mr. Jones will be rich. If it’s a ball with a six instead, Mr. Johnson will be.
To test the causal effects of monetary windfalls, economists compare those who win lotteries to those who buy tickets but lose. These studies have generally found that winning the lottery does not make you happy in the short run but does in the long run.*
Economists can also utilize the randomness of lotteries to see how one’s life changes when a neighbor gets rich. The data shows that your neighbor winning the lottery can have an impact on your own life. If your neighbor wins the lottery, for example, you are more likely to buy an expensive car, such as a BMW. Why? Almost certainly, economists maintain, the cause is jealousy after your richer neighbor purchased his own expensive car. Chalk it up to human nature. If Mr. Johnson sees Mr. Jones driving a brand-new BMW, Mr. Johnson wants one, too.
Unfortunately, Mr. Johnson often can’t afford this BMW, which is why economists found that neighbors of lottery winners are significantly more likely to go bankrupt. Keeping up with the Joneses, in this instance, is impossible.
But natural experiments don’t have to be explicitly random, like lotteries. Once you start looking for randomness, you see it everywhere—and can use it to understand how our world works.
Doctors are part of a natural experiment. Every once in a while, the government, for essentially arbitrary reasons, changes the formula it uses to reimburse physicians for Medicare patients. Doctors in some counties see their fees for certain procedures rise. Doctors in other counties see their fees drop.
Two economists—Jeffrey Clemens and Joshua Gottlieb, a former classmate of mine—tested the effects of this arbitrary change. Do doctors always give patients the same care, the care they deem most necessary? Or are they driven by financial incentives?
The data clearly shows that doctors can be motivated by monetary incentives. In counties with higher reimbursements, some doctors order substantially more of the better-reimbursed procedures—more cataract surgeries, colonoscopies, and MRIs, for example.
And then, the big question: do their patients fare better after getting all this extra care? Clemens and Gottlieb reported only “small health impacts.” The authors found no statistically significant impact on mortality. Give stronger financial incentives to doctors to order certain procedures, this natural experiment suggests, and some will order more procedures that don’t make much difference for patients’ health and don’t seem to prolong their lives.
Natural experiments can help answer life-or-death questions. They can also help with questions that, to some young people, feel like life-or-death.
Stuyvesant High School (known as “Stuy”) is housed in a ten-floor, $150 million tan, brick building overlooking the Hudson River, a few blocks from the World Trade Center, in lower Manhattan. Stuy is, in a word, impressive. It offers fifty-five Advanced Placement (AP) classes, seven languages, and electives in Jewish history, science fiction, and Asian-American literature. Roughly one-quarter of its graduates are accepted to an Ivy League or similarly prestigious college. Stuyvesant trained Harvard physics professor Lisa Randall, Obama strategist David Axelrod, Academy Award–winning actor Tim Robbins, and novelist Gary Shteyngart. Its commencement speakers have included Bill Clinton, Kofi Annan, and Conan O’Brien.
The only thing more remarkable than Stuyvesant’s offerings and graduates is its cost: zero dollars. It is a public high school and probably the country’s best. Indeed, a recent study used 27 million reviews by 300,000 students and parents to rank every public high school in the United States. Stuy ranked number one. It is no wonder, then, that ambitious, middle-class New York parents and their equally ambitious progeny can become obsessed with Stuy’s brand.
For Ahmed Yilmaz,* the son of an insurance agent and teacher in Queens, Stuy was “the high school.”
“Working-class and immigrant families see Stuy as a way out,” Yilmaz explains. “If your kid goes to Stuy, he is going to go to a legit, top-twenty university. The family will be okay.”
So how can you get into Stuyvesant High School? You have to live in one of the five boroughs of New York City and score above a certain number on the admission exam. That’s it. No recommendations, no essay, no legacy admission, no affirmative action. One day, one test, one score. If your number is above a certain threshold, you’re in.
Each November, approximately 27,000 New York youngsters sit for the admission exam. The competition is brutal. Fewer than 5 percent of those who take the test get into Stuy.
Yilmaz explains that his mother had “worked her ass off” and put what little money she had into his preparation for the test. After months spending every weekday afternoon and full weekends preparing, Yilmaz was confident he would get into Stuy. He still remembers the day he received the envelope with the results. He missed by two questions.
I asked him what it felt like. “What does it feel like,” he responded, “to have your world fall apart when you’re in middle school?”
His consolation prize was hardly shabby—Bronx Science, another exclusive and highly ranked public school. But it was not Stuy. And Yilmaz felt Bronx Science was more a specialty school meant for technical people. Four years later, he was rejected from Princeton. He attended Tufts and has shuffled through a few careers. Today he is a reasonably successful employee at a tech company, although he says his job is “mind-numbing” and not as well compensated as he’d like.
More than a decade later, Yilmaz admits that he sometimes wonders how life would have played out had he gone to Stuy. “Everything would be different,” he says. “Literally, everyone I know would be different.” He wonders if Stuyvesant High School would have led him to higher SAT scores, a university like Princeton or Harvard (both of which he considers significantly better than Tufts), and perhaps more lucrative or fulfilling employment.
It can be anything from entertaining to self-torture for human beings to play out hypotheticals. What would my life be like if I made the move on that girl or that boy? If I took that job? If I went to that school? But these what-ifs seem unanswerable. Life is not a video game. You can’t replay it under different scenarios until you get the results you want.
Milan Kundera, the Czech-born writer, has a pithy quote about this in his novel The Unbearable Lightness of Being: “Human life occurs only once, and the reason we cannot determine which of our decisions are good and which bad is that in a given situation we can make only one decision; we are not granted a second, third or fourth life in which to compare various decisions.”
Yilmaz will never experience a life in which he somehow managed to score two points higher on that test.
But perhaps there’s a way we can gain some insight on how different his life may or may not have been by doing a study of large numbers of Stuyvesant High School students.
The blunt, naïve methodology would be to compare all the students who went to Stuyvesant and all those who did not. We could analyze how they performed on AP tests and SATs—and what colleges they were accepted into. If we did this, we would find that students who went to Stuyvesant score much higher on standardized tests and get accepted to substantially better universities. But as we’ve seen already in this chapter, this kind of evidence, by itself, is not convincing. Maybe the reason Stuyvesant students perform so much better is that Stuy attracts much better students in the first place. Correlation here does not prove causation.
To test the causal effects of Stuyvesant High School, we need to compare two groups that are almost identical: one that got the Stuy treatment and one that did not. We need a natural experiment. But where can we find it?
The answer: students, like Yilmaz, who scored very, very close to the cutoff necessary to attend Stuyvesant.* Students who just missed the cutoff are the control group; students who just made the cut are the treatment group.
There is little reason to suspect students on either side of the cutoff differ much in talent or drive. What, after all, causes a person to score just a point or two higher on a test than another? Maybe the lower-scoring one slept ten minutes too little or ate a less nutritious breakfast. Maybe the higher-scoring one had remembered a particularly difficult word on the test from a conversation she had with her grandmother three years earlier.
In fact, this category of natural experiments—utilizing sharp numerical cutoffs—is so powerful that it has its own name among economists: regression discontinuity. Anytime there is a precise number that divides people into two different groups—a discontinuity—economists can compare—or regress—the outcomes of people very, very close to the cutoff.
Two economists, M. Keith Chen and Jesse Shapiro, took advantage of a sharp cutoff used by federal prisons to test the effects of rough prison conditions on future crime. Federal inmates in the United States are given a score, based on the nature of their crime and their criminal history. The score determines the conditions of their prison stay. Those with a high enough score will go to a high-security correctional facility, which means less contact with other people, less freedom of movement, and likely more violence from guards or other inmates.
Again, it would not be fair to compare the entire universe of prisoners who went to high-security prisons to the entire universe of prisoners who went to low-security prisons. High-security prisons will include more murderers and rapists, low-security prisons more drug offenders and petty thieves.
But those right above or right below the sharp numerical threshold had virtually identical criminal histories and backgrounds. This one measly point, however, meant a very different prison experience.
The result? The economists found that prisoners assigned to harsher conditions were more likely to commit additional crimes once they left. The tough prison conditions, rather than deterring them from crime, hardened them and made them more violent once they returned to the outside world.
So what did such a “regression discontinuity” show for Stuyvesant High School? A team of economists from MIT and Duke—Atila Abdulkadirog˘lu, Joshua Angrist, and Parag Pathak—performed the study. They compared the outcomes of New York pupils on both sides of the cutoff. In other words, these economists looked at hundreds of students who, like Yilmaz, missed Stuyvesant by a question or two. They compared them to hundreds of students who had a better test day and made Stuy by a question or two. Their measures of success were AP scores, SAT scores, and the rankings of the colleges they eventually attended.
Their stunning results were made clear by the title they gave the paper: “Elite Illusion.” The effects of Stuyvesant High School? Nil. Nada. Zero. Bupkus. Students on either side of the cutoff ended up with indistinguishable AP scores and indistinguishable SAT scores and attended indistinguishably prestigious universities.
The entire reason that Stuy students achieve more in life than non-Stuy students, the researchers concluded, is that better students attend Stuyvesant in the first place. Stuy does not cause you to perform better on AP tests, do better on your SATs, or end up at a better college.
“The intense competition for exam school seats,” the economists wrote, “does not appear to be justified by improved learning for a broad set of students.”
Why might it not matter which school you go to? Some more stories can help get at the answer. Consider two more students, Sarah Kaufmann and Jessica Eng, two young New Yorkers who both dreamed from an early age of going to Stuy. Kaufmann’s score was just on the cutoff; she made it by one question. “I don’t think anything could be that exciting again,” Kaufmann recalls. Eng’s score was just below the cutoff; she missed by one question. Kaufmann went to her dream school, Stuy. Eng did not.
So how did their lives end up? Both have since had successful, and rewarding, careers—as do most people who score in the top 5 percent of all New Yorkers on tests. Eng, ironically, enjoyed her high school experience more. Bronx Science, where she attended, was the only high school with a Holocaust museum. Eng discovered she loved curation and studied anthropology at Cornell.
Kaufmann felt a little lost in Stuy, where students were heavily focused on grades and she felt there was too much emphasis on testing, not on teaching. She called her experience “definitely a mixed bag.” But it was a learning experience. She realized, for college, she would only apply to liberal arts schools, which had more emphasis on teaching. She got accepted to her dream school, Wesleyan University. There she found a passion for helping others, and she is now a public interest lawyer.
People adapt to their experience, and people who are going to be successful find advantages in any situation. The factors that make you successful are your talent and your drive. They are not who gives your commencement speech or other advantages that the biggest name-brand schools offer.
This is only one study, and it is probably weakened by the fact that most of the students who just missed the Stuyvesant cutoff ended up at another fine school. But there is growing evidence that, while going to a good school is important, there is little gained from going to the greatest possible school.
Take college. Does it matter if you go to one of the best universities in the world, such as Harvard, or a solid school such as Penn State?
Once again, there is a clear correlation between the ranking of one’s school and how much money people make. Ten years into their careers, the average graduate of Harvard makes $123,000. The average graduate of Penn State makes $87,800.
But this correlation does not imply causation.
Two economists, Stacy Dale and Alan B. Krueger, thought of an ingenious way to test the causal role of elite universities on the future earning potential of their graduates. They had a large dataset that tracked a whole host of information on high school students, including where they applied to college, where they were accepted to college, where they attended college, their family background, and their income as adults.
To get a treatment and control group, Dale and Krueger compared students with similar backgrounds who were accepted by the same schools but chose different ones. Some students who got into Harvard attended Penn State—perhaps to be nearer to a girlfriend or boyfriend or because there was a professor they wanted to study under. These students, in other words, were just as talented, according to admissions committees, as those who went to Harvard. But they had different educational experiences.
So when two students, from similar backgrounds, both got into Harvard but one chose Penn State, what happened? The researchers’ results were just as stunning as those on Stuyvesant High School. Those students ended up with more or less the same incomes in their careers. If future salary is the measure, similar students accepted to similarly prestigious schools who choose to attend different schools end up in about the same place.
Our newspapers are peppered with articles about hugely successful people who attended Ivy League schools: people like Microsoft founder Bill Gates and Facebook founders Mark Zuckerberg and Dustin Moskovitz, all of whom attended Harvard. (Granted, they all dropped out, raising additional questions about the value of an Ivy League education.)
There are also stories of people who were talented enough to get accepted to an Ivy League school, chose to attend a less prestigious school, and had extremely successful lives: people like Warren Buffett, who started at the Wharton School at the University of Pennsylvania, an Ivy League business school, but transferred to the University of Nebraska–Lincoln because it was cheaper, he hated Philadelphia, and he thought the Wharton classes were boring. The data suggests, earnings-wise at least, that choosing to attend a less prestigious school is a fine decision, for Buffett and others.
This book is called Everybody Lies. By this, I mostly mean that people lie—to friends, to surveys, and to themselves—to make themselves look better.
But the world also lies to us by presenting us with faulty, misleading data. The world shows us a huge number of successful Harvard graduates but fewer successful Penn State graduates, and we assume that there is a huge advantage to going to Harvard.
By cleverly making sense of nature’s experiments, we can correctly make sense of the world’s data—to find what’s really useful and what is not.
Natural experiments relate to the previous chapter, as well. They often require zooming in—on the treatment and control groups: the cities in the Super Bowl experiment, the counties in the Medicare pricing experiment, the students close to the cutoff in the Stuyvesant experiment. And zooming in, as discussed in the previous chapter, often requires large, comprehensive datasets—of the type that are increasingly available as the world is digitized. Since we don’t know when nature will choose to run her experiments, we can’t set up a small survey to measure the results. We need a lot of existing data to learn from these interventions. We need Big Data.
There is one more important point to make about the experiments—either our own or those of nature—detailed in this chapter. Much of this book has focused on understanding the world—how much racism cost Obama, how many men are really gay, how insecure men and women are about their bodies. But these controlled or natural experiments have a more practical bent. They aim to improve our decision making, to help us learn interventions that work and those that do not.
Companies can learn how to get more customers. The government can learn how to use reimbursement to best motivate doctors. Students can learn what schools will prove most valuable. These experiments demonstrate the potential of Big Data to replace guesses, conventional wisdom, and shoddy correlations with what actually works—causally.