Skip to content
Link copied to clipboard

Researchers show ease of finding dubious results

Would you believe a scientific paper that said listening to the Beatles song "When I'm 64" made people get younger?

Researchers set up an experiment using the well-known Beatle's song to show how too many dubious studies are getting published in respected journals. (AP Photo/Lefteris Pitarakis)
Researchers set up an experiment using the well-known Beatle's song to show how too many dubious studies are getting published in respected journals. (AP Photo/Lefteris Pitarakis)Read more

Would you believe a scientific paper that said listening to the Beatles song "When I'm 64" made people get younger?

It was tested by experiment, and the result came out with "statistical significance," which is the gold standard for incorporating new findings into the established scientific literature.

The point of the experiment was not really to test the youth-restoring effects of the song, but to show how too many dubious studies in social sciences are getting published in respected journals. (If it were really true, Paul McCartney would get even richer.)

Wharton researcher Uri Simonsohn constructed the song experiment as a sort of test case, along with colleagues at the University of Pennsylvania and the University of California, Berkeley. They conducted a similar experiment showing that listening to the children's song "Hot Potato" made people feel older.

The researchers got both results using well-accepted practices for collecting, parsing, and analyzing statistical data, and both easily met qualifications for acceptance in peer-reviewed journals, said Simonsohn, a coauthor of the paper.

"It's unacceptably easy to publish statistically significant evidence consistent with any hypothesis," he said.

The problem isn't just that samples are too small, Simonsohn said. Researchers are also picking through their data in ways that make it too easy for them to find statistical flukes that look like real patterns. Current reporting practice allows this to go undetected.

"We didn't do a single thing that people aren't allowed to do all the time," he said.

In response, other scientists are looking at programs to reform their fields. Although Simonsohn doesn't want to point the finger at any of his colleagues, other researchers pointed to a number of studies, including one that purported to prove people can see the future with ESP.

University of Virginia psychologist Brian Nosek blames the publish-or-perish culture in science. To be successful, scientists have to get published in major journals, he said, and to be considered, a study needs more than just an important question and a rigorous design. The results also have to be surprising. And so, he said, scientists are motivated to sort through their data until they find something that looks surprising.

Such practices might explain the ESP study. There, Cornell University researcher Daryl Bem asked subjects to memorize some words, and one group was shown the words again after they'd all turned in their tests.

The group seeing the words after the fact did better - a result that supposedly proved there's ESP, even though the finding violated not only common sense but established physics. In fact, it was the product of mere coincidence, and the findings were debunked in March, when another team tried to replicate the finding and couldn't.

But most results never get such a second look, Nosek said. Unfortunately, he said, there's no ethic of replicating studies in psychology, so other equally flawed studies are incorporated into established wisdom.

"Lots of things are wrong, but we don't know which things," he said.

To change the situation, he's teamed up with 50 other psychologists to launch a project called the Open Science Collaboration. Their goal is to systematically repeat psychological experiments.

Simonsohn's approach was to create a couple of test cases to see what kinds of results could get through the filter.

He, Joseph Simmons of Penn, and Leif Nelson of Berkeley used a group of 20 student subjects from Penn, and in one study, had them listen to either "Kalimba," an instrumental song that comes free with Windows 7, or "Hot Potato." Then the subjects were asked how old they felt.

The results: They felt older after listening to "Hot Potato."

Did they, in fact, feel that way?

The results were statistically significant.

Using the exact same statistical methods, the team showed that when people got to listen to either "Kalimba" or "When I'm 64," "people were nearly a year and a half younger after listening to 'When I'm 64.' "

The way Simonsohn analyzed the data, the subjects didn't just feel younger - the study found they actually were younger - an impossibility.

Simonsohn took a statistical fluke (younger subjects happened to be in the group that listened to "When I'm 64"), played with the data to make it look statistically significant, and added the assumption that the song caused the subjects in that one group to get younger - a standard type of assumption for such experiments.

The problem is in the flexibility many scientists use when choosing and analyzing data, he said.

If you're collecting data on reaction times, he said, you might throw out extremely fast or slow times, but how do you honestly decide which data are outliers?

Scientists often decide when they have enough samples to stop taking data. Instead of going through 100 trials, say, they stop as soon as they get a statistically significant result, he said. (This is akin to leaving a poker game just after a lucky hand won you a big pot and put you ahead. There's a reason this is considered bad practice.)

Scientists are allowed to drop measurements and report only the ones that fit their conclusion. They can try 20 or even 50 analyses and report only the ones that worked, Simonsohn said. Or they can cherry pick variables out of many possibilities.

Simonsohn's paper also included a simulation showing various accepted ways to manipulate - or fudge - a set of data. When scientists were able to choose from different variables, throw out data, add data, and control for gender, the rate of false positives rose to 61 percent.

As a remedy, he suggests that scientific journals require more complete disclosure of how an experiment was really conducted. That way reviewers and journalists could better judge the study's validity.

Too much flexibility in data collection is not unique to psychology, said Stanford University researcher John Ioannidis, who has become a reformer of the way research is reported in medicine.

"We want to find new associations and correlations," he said. "The problem comes when the reported result is just a select view of what's been done and people are not aware of this."

In psychology and medicine, it's hard to apply the standard wisdom advocated by famed science communicator Carl Sagan - that extraordinary results require extraordinary evidence.

In physics, there's a concrete theoretical framework that makes it easier to identify what's extraordinary. A good example came up last year, when an experiment appeared to show that particles called neutrinos move faster than the speed of light. This would violate Einstein's relativity, and so it was considered a big enough deal that the community held off accepting the experiment without verification. The neutrinos failed to exceed the cosmic speed limit in subsequent experiments.

"For many scientific discoveries, we have no way to know if they make sense or they're absurd," Ioannidis said.

Despite uncovering a number of dubious studies in the medical literature, Ioannidis believes the scientific method is working. It's just that individual scientists have trouble applying it, and scientific communities don't always recognize when it's not properly followed.

"The glory of science is that it has that inbuilt uncertainty - it's not dogma," he said. "Observations and results can be refuted."