Science / Tech, Social Science

Why Citing a Scientific Study Does Not Finish An Argument

“Actually Studies Show…”

Chances are you’ve found yourself in a heated conversation among a group of friends, family, or colleagues when someone throws down the gauntlet: “Actually, studies show…” Some nod in silent agreement, others check their text messages, and finally someone changes the subject.

It’s hard to know what to say when people cite scientific studies to prove their point. Sometimes we know the study and its relative merits. But most of the time we just don’t know enough to confirm or refute the statement that the study is supposed to support. We are floating in a sea of information, and all we can do is flounder around for the nearest buoy to support a view that’s vaguely related to the conversation.

All of us lack the time to understand more than a small fraction of scientific research. For the most part, this works out well: scientists conduct research and publish papers, each new study adds another piece to the puzzle, and bit by bit we steadily increase the total stock of knowledge. Eventually, we hope, journalists and teachers will bring scientific knowledge together and distill it to the general public.

Of course, that’s not always how science works, or how knowledge is spread. A single study is rarely anything more than suggestive, and often it takes many replications under a variety of circumstances to provide strong justification for a conclusion. And yet, poorly supported studies often make their way into newspapers and conversations as if they are iron clad truths.

According to a spate of recent articles, many scientific results are difficult to replicate. The problem has been studied in detail by social psychologists, but the problem appears to be much more pervasive than initially thought. Some have argued that throughout the sciences most published research findings are false.

Correlations are Cheap, Patterns are Ubiquitous

Science typically involves gathering data, finding interesting correlations, and proposing hypotheses to explain the correlations. For example, suppose we find a set of sick people for whom antibiotics work, and a set for whom they don’t work. We might infer that those for whom antibiotics didn’t work had an antibiotic-resistant strain of a bacterial infection. Or we might think that the patients who didn’t recover had a different disease than those who did recover, perhaps a viral infection for which antibiotics don’t work.

Correlations are everywhere, and given enough data from enough studies, we will find correlations that are surprising and interesting. But as the sick patient example suggests, causation is difficult to infer, and some correlations are flukes that don’t admit of a common cause, or that can’t be consistently replicated.

We are pattern-seeking creatures, and correlations are patterns that cry out for explanation. But sometimes our political views infect our prior beliefs, and these beliefs lead us to look for patterns until we find them. Given enough tests and time, we will find them.

Consider the case of “stereotype threat.” The idea behind stereotype threat is that when a member of a group (e.g. race, sex, or religion) is asked to perform a task, but primed with information about how most people in their group perform that task, they will tend to perform in accordance with the group average rather than according to their own ability.

What the initial studies seemed to suggest was that stereotype threat is not just statistically significant, but large.  It appeared as if when blacks were told that the test they were taking was an indicator of intellectual ability, they scored worse than whites.  But when told that it was just a problem-solving exercise not indicative of ability, they scored about the same as whites.

Think about why people would be happy to find this result: if all we need to do to improve the outcomes of people in poorly performing groups is to prime them with certain kinds of information (or shield them from other kinds of information, such as negative stereotypes), we could dramatically improve test scores at school and productivity at work.

As you can imagine, the results were too good to be true, and stereotype threat has not stood up especially well to scientific scrutiny. It probably exists in some cases (some people gain or lose confidence when primed with certain kinds of information), but the magnitudes are usually small, and the social implications are unclear. Yet this hasn’t stopped universities and businesses from implementing training programs to combat the alleged evils of stereotype threat in the classroom and in the boardroom.

Publication Bias and Perverse Incentives

Researchers who discuss the “replication crisis” in science often emphasize publication bias and professional incentives as the primary culprits. Publication bias occurs when journal editors or ordinary readers place too much weight on a statistically significant study because they fail to think about the likely failure of many other attempts to find similar results. In other words, scientists often run tests and find data that don’t yield interesting results. But a “null result” is rarely published for the obvious reason that it’s not very surprising or interesting.

Even when a scientific study does find dramatic results, and the results can be replicated, subsequent results are generally less dramatic than the initial study. According to Brian Nosek, a social psychologist at the University of Virginia, we should predict that “replication effect sizes would be smaller than original studies on a routine basis—not because of differences in implementation but because the original study effect sizes are affected by publication and the replications are not.”

Researchers want to find interesting results, and are professionally rewarded for doing so. The rewards come in the form of career advancement, reputational enhancement, and a higher likelihood that journal editors will publish their results. These rewards usually help push science forward. But they can also slow science down rather than speed it up.

Consider the original study in the Lancet that linked the MMR vaccine to autism. The result was juicy, especially to journalists. The study found that taking the MMR vaccine significantly elevated the risk of autism in children, but that not taking the vaccine, or separating it into three separate components, would lower the risk of autism. If we could tackle autism this easily the world would be a much better place. The study fed into a widespread desire for an easy answer to a hard problem. But the study was wrong, and it took over a decade before the record was corrected.

In this case, the initial study itself turned out to be poorly designed. The publication of the autism study, and its promotion by journalists, probably cost lives as some parents declined to vaccinate their kids, and protested vaccine mandates.

But quite apart from the quality of the autism study, many studies that are reasonably well-designed are hard to replicate, and are probably either false or overblown.


The proliferation of scientific studies and the norms that make scientific journals more likely to publish surprising results than failed replication attempts are unnerving for a couple of reasons. First, politicians pass laws based on studies their advisors cite. Sometimes these laws are silly, and betray an absurd ignorance of science. For example, in 1998 the Governor of Georgia signed a law providing free classical music CDs to expectant mothers in order to boost their children’s IQ. Of course, this was based on, at best, weak evidence which has no business informing any type of policy.

But sometimes these laws are far-reaching, like the macro-economic policies that governments pursue in a financial crisis. We can easily find studies suggesting that providing a “stimulus” to the economy by increasing government spending in the short run can jump-start the economy during a downturn. But we can also find plenty of studies suggesting that the opposite happens, and these arguments go back to the early days of economics.

The right answer, almost certainly, is that we don’t know. Both sides gather immense amounts of data and weave theory and data into an intricate tapestry, translated into the universal language of math. But the mathematical sophistication of modern economics often gives us the illusion that we know more than we do.

The second reason to worry about the replication problem in science is that it becomes all too easy for teachers, friends, and colleagues (quite apart from politicians) to fool us into accepting a poorly supported conclusion that is intuitively satisfying but ultimately wrong. Malcolm Gladwell is a master of this. He has made a career out of telling stories that make people feel good about themselves by cherry-picking scientific studies that produce surprise and hope rather than fear and anxiety. Yet we should expect science to do both, since the truth doesn’t care about our emotional reactions.

We’re not advising you to commit social suicide by interrupting every conversation with a demand for more evidence. But we do think the phrase “studies show…” should be met with cautious skepticism, especially when the study supports the politically-motivated preconceptions of the person who’s talking.


Filed under: Science / Tech, Social Science


Jonathan Anomaly is a core faculty member of the Freedom Center, and Assistant Professor in the PPEL Program, at the University of Arizona.. Brian Boutwell is an Associate Professor of Criminology and Criminal Justice at Saint Louis University. Follow him on Twitter @fsnole1


  1. Kat says

    why citing the Bible does not finish an argument.

  2. Philip Grimm says

    This article is esoteric and is probably of interest only to highly educated individuals of a certain age. I will bet that both of the authors are under the age of forty.

    As an avid consumer of peer reviewed articles for almost forty years, I can make a few anecdotal observations. My field was diagnostic radiology.

    Half of all journal articles are wrong and are going to be proven wrong within the first six months. Interesting articles produce fads in treatment plans for certain diseases. Probably less then ten percent of those fads will continue more than a year. Fads come and go, it seems to be a merry-go-round. I could give dozens of specific examples, but most are too esoteric for non-radiologist to fully understand. One example: the pros and cons of thoracoplasty. Another, less obvious and perhaps more revealing example is the pros and cons of feeding someone with pancreatitis.

    As for the academics, which both of you clearly are, academics don’t know about real life, they know how to memorize, are adept at being surrounded by intellectuals and can take tests. My experience with writing peer reviewed articles is that the poor schmuck in the lab doing the work spends half his time trying to make the data fit the curve.

    In my humble opinion, quoting or having dueling quotes about conflicting peer reviewed articles is just mental masturbation. It can be fun, but it doesn’t really do anything.

    • Joe Lammers says

      Actually I don’t think this is esoteric at all. We should have a healthy scepticism about a lot of research, particularly social science research that is being used to promote a particular political agenda. A lot of it should be considered junk science. And though I identify as a conservative I mean this in a non partisan manner.

  3. Unfortunately I don’t think there is an easy solution to this. There is no set of tips, no pieces of advice, no list of instructions that we can give to people to know when a study is good or bad. In the end, you just have to know what you’re doing (which is why scientists get paid the “big bucks”). We can tell people about logical fallacies, not confusion correlation with causation, and the like, but I’ve seen them all misused/misunderstood by people trying to make their case. What I’m saying is that there is just no substitute for cleverness.

    • Allencic says

      One of the best descriptions of what scientific integrity should be is the last chapter of Richard Feynman’s wonderful book “Surely You’re Joking Mr. Feynman.” The chapter, “Cargo Cult Science” should be required reading of all science students when they take their first course and it should be read again when they finish. If everyone read this they’d realize that most of what passes for serious science is poorly done and just plain unprovable nonsense. The last paragraph is priceless.

  4. Jon says

    I have two science degrees and worked in research at university for a couple of years, and my estimate is that about 90% of all new research papers will turn out to be irrelevant, misguided or just plain wrong. An easy way to test this for yourself is to get hold of some old copies of Scientific American or New Scientist and see what percentage of the ‘striking discoveries’ and ‘remarkable breakthroughs’ they describe have actually had any impact on anything ten years later.

    People should know that the newer science is, the more unreliable it is likely to be. And the publications of sponsored scientists almost always reflect the views of the sponsor.

    But someone who withdraws from an argument just because ‘a study’ says they’re in error is probably not overly committed to their position anyway.

    • As the work of Dr John Ionnadis revealed and as Dr Richard Horton, former editor of The Lancet, and Dr Marcia Angell, former editor of The New England Journal of Medicine, have both stated.

  5. A revealing instance of this problem was the study by Bossio, Pukall and Steele, Examining Penile Sensitivity in Neonatally Circumcised and Intact Men Using Quantitative Sensory Testing, Journal of Urology, Vol. 195 (June 2016), which appeared to show, and was widely reported as proving, that circumcision made no difference to the sensitivity of the penis and thus no difference to male sexual response and experience. For a start, a small-scale study (62 men in this case) cannot prove anything; all it can do, as Anomaly and Boutwell point out, is to make a contribution to a controversial and unsettled question that has been the focus of discussion for some time. Even leaving aside the vagueness of this measure (do the authors mean sensitivity to pain or to pleasure? does sensitivity to pain or heat have much to do with sexual satisfaction?) the media reports seriously misrepresented the findings of the study, which actually showed that men with foreskins had a lower threshold of sensitivity and circumcised men a higher threshold – in other words, that men with foreskins were more sensitive to touch, and circumcised men less sensitive. As the authors admitted in their reply to criticisms of their article published in a later issue of the journal, “the foreskin was observed to be most sensitive to fine touch pressure thresholds.”

    There were effective criticisms of the limitations and inconclusiveness of the paper by Brian Earp published in the Huffington Post and Trends in Men’s Health, but these have attracted far less attention than the media reports, and subsequent reiteration of the claim of “no difference” by the authors on popular science forums. On top of these, three letters commenting the study were published in the Journal of Urology in December 2016, along with a reply by the authors, in which they concede the validity of some of the criticisms, deplore the politicisation of the debate, and call for objective investigation of this question. I don’t believe that the authors have any prior ideological motivation (i.e. that they desire to promote circumcision), but their claim of impartial scientific interest does seem a little too strident: the give-away is the conclusion that always accompanies studies of this kind – more research needed. The overriding imperative, in other words is to keep the research grants coming so that the team can keep their jobs. While the question of where penile sensation comes from is quite interesting, however, it cannot be said that it is an urgent priority for health research, not when cures for cancer, dementia, Zika virus, HIV and dangerous new strains of influenza etc remain elusive. Men through the ages have managed to get sexual satisfaction without a full understanding of where it comes from, and it is hard to see how the results of “more research” into the question would bring significant benefit to anybody.

    The letters in reply are difficult to find and behind the journal’s paywall, but have been made available here:

  6. Science, very often, is not worth the paper on which it is printed.

    Former editors of both The Lancet and The New England Journal of Medicine have both said much research is wrong and other research has shown most published research is false.

    Science is a system of enquiry and as long as medical science is dictated to and funded by the pharmaceutical industry there will be no good science and no research worth reading or quoting.

  7. In fact, the Wakefield paper was cautious about an association between vaccination and adverse consequences, but the subsequent press conference was the source of the wilder claims.
    The paper itself was risible because it was a small case series, but it had impact for a reason which was not evident in the paper. The first line was a lie: they were not consecutive patients who had come through the hospital in the normal way. It was fraudulent. I saw the paper a year after publication, before the fraud was uncovered, and simply laughed at the notion that a small case series could be seen as evidence of a mass vaccination effect.

  8. Pingback: Citing scientific studies and the arrogance of ignorance | Open Parachute

  9. Douglas Redosh MD says

    A good article, worthy of thought and reflection, but the comment by rosross above shows an antiscience ignorance, that is unfortunately too prevalent in the current administration.

  10. Scientist here. I agree with nearly everything in this piece.

    To interpret the scientific literature one almost always does need actual training — as much as we’d like to believe we’re all well-equipped to reason our way through a research paper in an area we have no background in, looking up words we don’t know on the fly, we’re not. Not by a long shot. Much of graduate school in science — and training to become a scientist more generally — is learning how to read, interrogate, and analyze others’ work. To do so requires not only that you understand the techniques they are using, the extant body of literature on the subject, have a firm grasp of statistics (the vast majority of people lack an even basic understanding), but it also requires you to become practiced in reading hundreds and hundreds of papers. This isn’t a trivia matter and this point in particular seems to elude non-scientists. Even after one does cultivate genuine expertise in a narrow subject area, the work of others’ in unrelated fields can be impenetrable. Most of my colleagues are aware of this and exhibit considerable humility about what they are qualified to interpret and what they simply aren’t.

    The only bit I might disagree with here is the warning about “scientism”, which nevertheless wasn’t really addressed in the paragraph below the headline. Yes, there are other things that matter in addition to scientific evidence, and these concerns and values must (and almost always are) taken seriously. For example, much of bioethics is dominated by religious scholars who lack any scientific training whatsoever but are afforded a seat at the table because their a- or even anti-scientific values are privileged in our society. But the charge of “scientism” is almost always hollow unless one is asserting that either science has no demonstrated epistemic prowess over non-evidence-based means of inquiry and reality-testing (it does), or one’s definition of knowledge or knowing is so impoverished as to maintain that “revelation” is as reliable as well-designed, properly-controlled experiments that yield data which makes possible meaningful statistical inference.

  11. Also, with respect to the point being raised vis a vis the so-called “reproducibility crisis” and the apparent breakthroughs that didn’t live up to the hype, a couple of remarks:

    The “reproducibility crisis” only affects certain fields of science. Biomedical research is often irreproducible not because the scientists are doing a bad job, but because of errors made in the statistical analysis, which many biomedical researchers are not sufficiently trained in — this is changing. Additionally, publication bias (novel, dramatic findings are published whereas negative results are rejected from journals) plays a role. Open-access publishing seeks to redress this complaint but the problem is difficult. Sometimes the irreproducibility is actually down to biology —
    biology is noisy and experiments with living organisms are much more complex than those measuring subatomic particles. This doesn’t mean biology isn’t worth doing, it just means it’s harder. Lastly, the hype most often doesn’t come from researchers, it comes from journalists who are trying to get clicks and eyeballs by making over-the-top claims, and university PR departments who area trying to make their institution look good. If you bother to do some reading, you can see that 99% of the time when these claims are being made you can find what the actual researchers said about their own work and it’s usually humble and cautious. Again, NOT the fault of the researchers, nor a flaw in their work.

Comments are closed.