It was the kind of experimental result that one might expect to be trumpeted on every morning chat show. Twenty Penn undergrads were asked, in accordance with a carefully designed methodological protocol, to listen either to The Beatles’ “When I’m Sixty-Four” or “Kalimba,” an instrumental song that comes free with the Windows 7 operating system. Would one song make them feel younger? Or better yet, actually become younger? In a subsequent survey the participants indicated their father’s age, which was used to statistically control for variation in baseline age across the experimental groups. When the researchers performed a standard ANVOCA test (an analysis of covariance) on their data, it revealed the predicted effect: “According to their birth dates, people were nearly a year-and-a-half younger after listening to ‘When I’m Sixty-Four’ … rather than to ‘Kalimba.’” And the result was statistically significant—the end-all and be-all of modern scientific research.
If you haven’t raced across the room to cue up your copy of Sergeant Pepper’s, give yourself a pat on the back for common sense. But give Joseph Simmons and Uri Simonsohn, both associate professors in Wharton’s operations and information management department, credit for laying bare an underappreciated problem in contemporary social-science research: an alarming rate of false-positive results.
Along with Leif Nelson, an associate professor at UC-Berkeley’s Haas School of Business, Simmons and Simonsohn obtained their patently impossible result without doing anything that isn’t common practice in statistical analysis.
The problem boils down to the number of degrees of freedom that researchers allow themselves when analyzing their data. For instance, determining sample size only after beginning data collection—say, running standard statistical analysis on a sample size of 25, and if the result doesn’t quite rise to statistical significance, running it again after 30 subjects, then 35, and so on. Or—as in the sham experiment described above—deciding what variable to use as a statistical control only after the data is in, especially when the data collected enables a choice between several possible control variables.
Another problematic practice is testing several experimental conditions (Simmons’ group actually had a third song involved in their experiment) but selectively dropping some of those measures in the statistical analysis. Even controlling for the interaction between gender and the experimental treatment can help skew results toward false significance. Doing all four of those things can generate a staggering 61 percent probability of obtaining a false-positive result, Simmons and his colleagues found when they examined their experimental data closely.
Exacerbating the problem is the fact that computers have made statistical analysis so easy to perform.
“When you have your statistics software package open,” Simmons says, “you can run it without controls and you can run it with controls. And once you can run it with controls, you can put in as many controls as you want, and you can take out controls, and try it many different ways.”
Which is pretty much what they did. “We just wanted to show, with sort of a ridiculous example, that it’s not hard to run studies and find crazy things.”
In July Simmons spoke with Gazette associate editor Trey Popp about their “experiment”—the results of which, along with an analysis of the problem of false-positive results, were published in Psychological Science.
How did you get interested in this subject?
At the time, I was at Yale and Uri was here at Wharton. We both organized these weekly journal clubs where we read articles and discussed them with our colleagues. And it was getting frustrating, because we were reading the results of these papers and they just seemed impossible.
Why did you focus on unintentionally derived false-positive results?
Despite the recent press about fraud in psychology, we still believe that fraud is still extremely rare and that’s not the main problem in psychology. Most people are well-intentioned. We just started thinking, and we realized that everyone does these little things to their data sets to try to get them to work. And lots of these things can be justified after the fact. You can say, ‘Okay, well, it didn’t work, but I have a lot of outliers here, let me remove those outliers.’ Because we know that outliers are bad, and all these other papers say outliers are bad, and all these other papers say that removing outliers is okay, so now I’ll take mine out. You don’t even remember you made that decision later on. So, you do that and now it works.
How did removing outliers become the standard practice?
In some cases, removing outliers is perfectly acceptable. You could run a study where you asked people: How much are you willing to pay for a new car? And most people give you responses between $10,000 and $50,000. But imagine you have a couple people in your survey who clearly weren’t taking it seriously, and said a million dollars, or $3 million. The way statistics work, that would completely screw up your analysis, and make it nearly impossible for you to figure out a statically significant effect. So in some cases it’s perfectly acceptable to remove outliers. The problem is, you have to decide it in advance. And that’s what people don’t do. All of our statistics assume that we’re running one analysis. Once you start doing multiple analyses [by making decisions after the fact], our statistics are not valid any more.
What makes it imperative to do just a single analysis?
What we try to do is make sure that our error rate is less than 5 percent. Imagine that you flip a coin five times in a row, and you get a heads each time. That’ll happen 5 percent of the time. Now, if 100 people do it, the probability that someone gets heads five times in a row is really high. And that’s basically the issue.
So running each new analysis is kind of like giving yourself another chance to flip five heads in a row?
Are the social sciences unique in their susceptibility to this problem?
I don’t think it’s unique at all to the social sciences. We targeted the social sciences because we are in the social sciences, and particularly we care about psychology, and we wrote the article not only to talk about the problem but to talk about ways to fix it. Because we really want psychology to get better and we want this problem to stop. But after publishing our paper, we’ve gotten countless emails [from scientists who recognize this problem in their own disciplines]. I’ve never been emailed more about a paper in my life, and probably never will again. Lots of these emails came from people in totally different disciplines—the biological sciences; we even had archaeologists email us.
Sometimes medical studies [are prone to this]. Nutrition studies definitely do this—they run this big study, and then they look for interesting correlations. Now, if you find those interesting correlations and then publish them after doing that, that’s really bad. As a method of hypothesis generation, it’s probably pretty good. But then you need a confirmatory stage where you go test it for real, and basically do it again. Science should be all about replication.
A lot of people view replication in terms of a separate group of scientists being able to repeat an experiment and obtain the same result. But the incentive for scientists is to make their own discoveries, not go around repeating old experiments. Should the replication phase be built in to original research?
Absolutely. I’m definitely changing the way I do things. So if I find something, and I know that I’ve picked around the data, I will immediately go and replicate that again. And if I don’t get it again, I stop studying it. It is amazing—there just hasn’t been, at least in recent history, an emphasis on making sure your findings are replicable and robust before you submit them. Instead there’s much more focus on novelty.
What’s to credit or blame for that shortfall?
When we talked to people about this, they said, ‘Yeah, I know I do these things and I know that they’re not perfect’—but they basically thought that those would have a trivial effect on the false-positive rate. And we thought the effects would be pretty small, but when we simulated them, and found the answer to what that actually does to our false positive rates, it was scary—much higher.
What suggestions do you have to spread this message?
People shouldn’t be allowed to drop measures that they collected; they should at least have to report what they were so that people aren’t cherry-picking measures. They should have to say how they arrived at their final sample size. If they do remove outliers, they should at least have to say what the results were with them—at least then it’s the review team that’s deciding if that’s okay rather than the researchers, because researchers have a conflict of interest—because they’re trying to publish. We were really hoping that journals would adopt some of these things. That hasn’t quite happened yet, although discussions are ongoing. The idea is that if you sign something that says, ‘I am reporting every single measure,’ and then later it is discovered that you did not do that, that is fraud.
We think that a lot of the change will come from the bottom up. Scientists are not interested themselves in publishing false papers. They’re interested in doing things right because they are good people and they do care about their profession.