Skip to content

Psychology

Bad Data Analysis and Psychology's Replication Crisis

This isn’t the first time this has happened in video game research.

· 6 min read
A man playing video games in the dark.
Alexander Andrews on Unsplash

In 2014, a study published in JAMA Pediatrics linked playing aggressive video games to real-life aggression in a large sample of Singaporean youth. The study attracted considerable news media attention. For instance, a sympathetic article in Time magazine breathlessly reported its findings and suggested that brain imaging research found aggressive games change kids’ brains. But was the evidence from the Singapore study reliable?

In recent years, concerns about the Singapore dataset have grown. UK scholars Andrew Przybylski and Netta Weinstein recently wrote that the way the dataset had been used by the primary authors was problematic. The analyses from the same dataset kept changing across published articles in suspicious ways. These shifting analyses can be a red flag for the data massaging that may produce statistically significant outcomes and hide outcomes that didn’t work out. Such practices may be unintentional or unconscious (scholars are only human after all). But they do suggest that the results could do with further scrutiny.

When the dataset became available to my colleague John Wang and me, we re-analyzed the data using more rigorous methods. We publicly pre-registered our analyses, which meant we couldn’t subsequently alter them to fit our hypotheses. Our results were strikingly different from the 2014 paper: in fact, there was no evidence in the dataset that aggressive game play was related to later aggression at all. So, what happened? How did a dataset come to show links that don’t exist between aggressive video games and youth aggression?

This isn’t the first time this has happened in video game research. Recently, another study appeared to link violent media to irresponsible gun behavior among youth. However, an independent researcher re-analyzed that dataset and found that the questionable elimination of a few inconvenient participants had transformed non-significant results into significant ones. Furthermore, recent brain imaging studies have not supported the claims made in Time‘s 2104 article.

The problem is larger than video game research though. For almost a decade now, psychological science has been undergoing a significant replication crisis wherein many previously held truisms are proving to be false. Put simply, many psychological findings that make it into headlines, or even policy statements issued by professional guilds like the American Psychological Association, are false and rely upon flimsy science. This happens for two reasons. The first is publication bias wherein studies that find novel, exciting things are preferred to those that find nothing at all. And second, death by press release, or the tendency of psychology to market trivial and unreliable outcomes as having more impact than they actually do.

Publication Bias

The tendency to publish only or mainly statistically significant results and not null findings is due to a perverse incentive structure within academia. Most of our academic jobs depend more on producing research and getting grants than they do teaching or committee work. This creates the infamous publish or perish structure, in which we either publish lots of science articles or we don’t get tenure, promotions, raises, prestige, etc. Scientific studies typically require an investment of months or even years. As an academic, if we invest years on a science study we must get it published or we will lose our jobs or funding.

If journals only publish statistically significant findings, the outcome of that study must be statistically significant. Typically, scholars can choose from dozens or even hundreds of ways to analyze their data. Thus, if we need statistically significant results, we can simply cycle through these different analytical options until we get the outcomes we need to publish. Variations on this include p-hacking (when analyses are run in multiple ways, but only those that produce statistically significant findings are reported) and HARKing (Hypothesis After Results are Known.) Harking occurs when scholars run multiple analyses between numerous variables without any particular theory. When they publish their findings, they pretend the statistically significant ones were those they predicted all along. In fact, they are usually the product of random chance and therefore unreliable. This perverse incentive structure is thought to be the source of considerable scientific misconduct.

So why do journals tend to publish only statistically significant findings? Part of this has to do with the nature of traditional statistical analyses, which made it easy for scholars to dismiss non-significant results as lacking any meaning. Effectively, what is called Null-Hypothesis Significance Testing actually makes it difficult to prove a theory false, which is a way of turning science on its head. But also, statistically significant findings are more exciting, increase readership (this matters to academic journals too), and attract media interest. Dutch scholar Daniël Lakens recently said:

I agree.

Death by Press Release

Death by Press Release occurs when scholars fail to disclose the trivial or unreliable nature of some findings, often through a university press release. The recent imbroglio over whether social media leads to mental health problems in teens is an example. A recent study by Jean Twenge and colleagues claimed that social media use is associated with decreased mental health among youth. Pretty alarming! However, a re-analysis of the data by Oxford researchers found that, although statistically significant, the effect was no greater than the (also significant but obviously nonsense) correlation between eating bananas and mental health or wearing eyeglasses and mental health, neither of which produce anxious think-pieces.

In large samples (the studies mentioned above had hundreds of thousands of participants), very tiny correlations can become “statistically significant” even though the magnitude of the effect is tiny. Usually this magnitude or “effect size” is demonstrated by the proportion of variance in one variable explained by another. In other words, if the only thing you knew about people was variable X, how accurate would you be in predicting variable Y above chance. So, zero percent variance explained would be literally no better than a random coin toss, whereas 100 percent would be perfect predictive accuracy. The effects for screens on mental health suggests screens account for far less than one percent of variance in mental health, little better than a coin toss. Dr. Twenge has published defenses of her results, arguing that many important medical effects also have small percentage of variance explained. However, these claims are quite simply based on miscalculations of the medical effects which are actually much stronger in terms of variance explained. Although these miscalculations were revealed over a decade ago, this scientific urban legend is sometimes repeated by psychologists because it makes psychological research sound much stronger than it actually is.

The Other Crisis in Psychology
Sydney. London. Toronto.