đĽ NEW YEAR SALE: 50% OFF Quillette Membership for the First 3 Months đĽ
Learn more
→
The Great Statistical Schism
In the early 20th century, this led to a split in the field of statistics, with intense debates taking place about whose methods and ways of thinking were better.
What is probability? This sounds like a discussion question for a philosophy class, one of those questions thatâs fun to think about but that doesnât have many practical consequences. Surprisingly, this is not the case. As it turns out, different answers to this question lead to completely different views of how to do statistics and data analysis in practice. In the early 20th century, this led to a split in the field of statistics, with intense debates taking place about whose methods and ways of thinking were better. Unfortunately, the wrong side won the debate and their ideas still dominate mainstream statistics, a situation which has exacerbated the reproducibility crises affecting science today [1].
Hereâs a common, standard statistical inference problem. An old drug successfully treats 70% of patients. To test a new drug, researchers give it to 100 patients, 83 of whom recover. Based on this evidence, how certain should we be that the new drug is worse than, identical to, or better than the old one?
If you think it is legitimate to use the mathematics of probability theory to study the concept of plausibility, you are a âBayesianâ (after Reverend Thomas Bayes, one of the first people to use probability this way). Faced with this question, your job is to use probability theory to calculate how plausible it is that the new drug is much worse, slightly worse, identical to, slightly better, or much better than the old one, taking into account the result of the experiment. The answer you get depends on both the experimental result itself, as well as what you assume about how plausible all of those hypotheses were before you knew the result of the experiment â the dreaded âpriorâ. Under standard assumptions the âposteriorâ probability the new drug is better than the old one is 0.89, the probability itâs the same is 0.11 (starting from a prior of 0.5), and the probability itâs worse is practically zero.
To a âfrequentistâ, thereâs only one legitimate application of probabilities to the real world: to relative frequencies or proportions (plausibility is off limits because itâs too vague to be subject to concrete mathematical rules). Think of the joke â84% of statistics are made upâ. It is true that probability theory works here. But can we apply this to the drug example? We can, but it requires some imagination. The standard method is to imagine repeating the experiment many times. If we did this, the number of patients who recovered would not be 83 every single time. Sometimes weâd just happen to get more patients who recover (for whatever reason) and sometimes weâd get less. Interestingly, by ârepeating the experimentâ we actually mean ârepeating a similar experimentâ â nobody has the power to repeat anything down to every detail. With this concept of repetitions, we can talk about what fraction of the time a certain experimental result would be observed. If our observed result is unusual under the assumption of some hypothesis, we have evidence against that hypothesis. We just need to figure out how to quantify how unusual our result is.
For example, consider the ânullâ hypothesis that the new drug has exactly the same effectiveness as the old drug. Then, under standard assumptions weâd typically observe about 70 or so of the patients recovering. The fraction of the time weâd observe 83 recoveries (our actual result) is 0.0012. This number is pretty small, and in general, any actual experimental result is so specific it would hardly ever be observed again. Therefore, 0.0012 canât be a good way of quantifying how unusual our result is. To âsolveâ this problem, the influential mathematical biologist R. A. Fisher suggested calculating the fraction of the time we would see our exact result or one thatâs even more different from what the null hypothesis would predict. In our case, we need the fraction of the time weâd see 83 or more, or 57 or less recoveries, which is 0.006. This quantity is called a p-value. An oft-criticised convention is that a p-value less than 0.05 implies evidence against the null hypothesis, and our result of 0.006 certainly qualifies as such.
Look at the numbers produced by the two different methods. The frequentist p-value of 0.006 is much lower than the probability the new drug is the same, 0.11. The two quantities are defined and calculated differently, so of course theyâre different. But look how big the difference is. Doesnât 0.006 just feel more compelling? Itâs 6 out of 1000 so close to zero! The Bayesian probability of 0.11 (that the drugs are the same) is a much more moderate number. Itâs quite common for the frequentistâs p-value to be much closer to zero than the Bayesianâs posterior probability. So any time you see a scientific paper claiming a result with p < 0.05, this is nothing like 95% certainty that the result is real.
This is one reason why so many experimental results arenât being reproduced; the evidence for them was never strong anyway. In fact many studies reporting p < 0.05 may actually have found evidence in favour of the null hypothesis, meaning a âfailedâ replication attempt is actually a success!
Good statisticians know that p-values and posterior probabilities are not the same thing, and try to impress it upon their students and scientific colleagues, often without much success. However, a simpler solution exists. If we want to know the plausibility of a hypothesis, why donât we just calculate it? If you want to know X , calculate X . Donât calculate Y instead, try to use it for insight about X , and then complain about all the people interpreting Y as if it were X . Itâs not their fault.
The main criticism of Bayesian methods is that theyâre subjective; the results depend not just on the data being analysed but also on the other ingredients you must supply, the prior probabilities (how plausible were your hypotheses before taking into account the data?). Itâs true these extra assumptions are needed, but this is just accepting reality. For example, if we had assumed the new drugâs effectiveness is probably extremely close to that of the old drug, the data would have been uninformative about the question of âequal or notâ, and the posterior probability would have equalled the prior probability. This is not a problem. It is logic.
Things have improved markedly for Bayesian statisticians over the last few decades. These days, youâre unlikely to encounter any hostility by doing a Bayesian analysis. Pragmatism is the most popular attitude. Itâs possible to do interesting and useful analyses using tools arising from frequentist thinking, Bayesian thinking, creative invention, or a mixture of all of these, and most statisticians are happy to do so.
One downside of this ecumenicalism is a reluctance to ask fundamental questions: having a strong opinion on this matter has gone out of fashion. Whoâs to say one statistical philosophy is better than another? Arenât all statistical philosophies equally valid paths to good data analysis? Frequentism is âtrue for meâ. As in religion, so in statistics. If you criticise a colleague for using p-values when posterior probabilities are clearly more appropriate, it will lead to accusations of being a âzealotâ [2] who should stop âcrusadingâ.
A year ago I went to a talk by prominent âskepticâ Steven Novella, in which he advocated for a Bayesian approach to judging the plausibility of medical hypotheses. During the question and answer session, a statistics department colleague of mine raised his hand and said Bayesian statistics was âbullshitâ because degrees of plausibility are not empirically measurable. I disagree strongly, but it was refreshing to see someone willing to argue for a view.
Another, more consequential downside is a reluctance to abandon bad ideas. Frequentist confidence intervals and p-values should still be taught to some extent; since so much research is based on them, our students need to know what they are. Other ideas that arose from frequentism, such as maximum likelihood estimation and the notion of âbiasâ, are still useful for obtaining fast approximate results and for computational algorithms respectively. In some difficult research problems they might help if a Bayesian solution is too mathematically or computationally difficult to obtain.
Yet it is simply false to claim that we teach these methods because theyâre better or easier than Bayesian ones. I would argue they are both worse and more difficult. In physics, undergraduate students learn Newtonâs ideas about gravity before Einsteinâs, because theyâre much easier conceptually and mathematically, and give the right answer on many problems even though Einsteinâs theory is more correct. If Einsteinâs theory were easier (as well as being more accurate), teaching Newtonâs would be silly. Yet thatâs the way most statistics curricula are structured. The only reason statisticians think frequentist ideas are easier is that they are used to them. The only reason they think Bayesian ideas must wait until graduate school is that the lower-level textbooks havenât been written (although some easier material exists, such as the books by Allen Downey [3] and John Kruschke [4], and my lecture notes [5]).
Itâs time to change this. Letâs teach Bayes first.