Must Reads, Science / Tech, Spotlight, Tech

The Great Statistical Schism

What is probability? This sounds like a discussion question for a philosophy class, one of those questions that’s fun to think about but that doesn’t have many practical consequences. Surprisingly, this is not the case. As it turns out, different answers to this question lead to completely different views of how to do statistics and data analysis in practice. In the early 20th century, this led to a split in the field of statistics, with intense debates taking place about whose methods and ways of thinking were better. Unfortunately, the wrong side won the debate and their ideas still dominate mainstream statistics, a situation which has exacerbated the reproducibility crises affecting science today [1].

Here’s a common, standard statistical inference problem. An old drug successfully treats 70% of patients. To test a new drug, researchers give it to 100 patients, 83 of whom recover. Based on this evidence, how certain should we be that the new drug is worse than, identical to, or better than the old one?

If you think it is legitimate to use the mathematics of probability theory to study the concept of plausibility, you are a ‘Bayesian’ (after Reverend Thomas Bayes, one of the first people to use probability this way). Faced with this question, your job is to use probability theory to calculate how plausible it is that the new drug is much worse, slightly worse, identical to, slightly better, or much better than the old one, taking into account the result of the experiment. The answer you get depends on both the experimental result itself, as well as what you assume about how plausible all of those hypotheses were before you knew the result of the experiment — the dreaded ‘prior’. Under standard assumptions the ‘posterior’ probability the new drug is better than the old one is 0.89, the probability it’s the same is 0.11 (starting from a prior of 0.5), and the probability it’s worse is practically zero.

To a ‘frequentist’, there’s only one legitimate application of probabilities to the real world: to relative frequencies or proportions (plausibility is off limits because it’s too vague to be subject to concrete mathematical rules). Think of the joke “84% of statistics are made up”. It is true that probability theory works here. But can we apply this to the drug example? We can, but it requires some imagination. The standard method is to imagine repeating the experiment many times. If we did this, the number of patients who recovered would not be 83 every single time. Sometimes we’d just happen to get more patients who recover (for whatever reason) and sometimes we’d get less. Interestingly, by ‘repeating the experiment’ we actually mean ‘repeating a similar experiment’ — nobody has the power to repeat anything down to every detail. With this concept of repetitions, we can talk about what fraction of the time a certain experimental result would be observed. If our observed result is unusual under the assumption of some hypothesis, we have evidence against that hypothesis. We just need to figure out how to quantify how unusual our result is.

For example, consider the ‘null’ hypothesis that the new drug has exactly the same effectiveness as the old drug. Then, under standard assumptions we’d typically observe about 70 or so of the patients recovering. The fraction of the time we’d observe 83 recoveries (our actual result) is 0.0012. This number is pretty small, and in general, any actual experimental result is so specific it would hardly ever be observed again. Therefore, 0.0012 can’t be a good way of quantifying how unusual our result is. To “solve” this problem, the influential mathematical biologist R. A. Fisher suggested calculating the fraction of the time we would see our exact result or one that’s even more different from what the null hypothesis would predict. In our case, we need the fraction of the time we’d see 83 or more, or 57 or less recoveries, which is 0.006. This quantity is called a p-value. An oft-criticised convention is that a p-value less than 0.05 implies evidence against the null hypothesis, and our result of 0.006 certainly qualifies as such.

Look at the numbers produced by the two different methods. The frequentist p-value of 0.006 is much lower than the probability the new drug is the same, 0.11. The two quantities are defined and calculated differently, so of course they’re different. But look how big the difference is. Doesn’t 0.006 just feel more compelling? It’s 6 out of 1000 so close to zero! The Bayesian probability of 0.11 (that the drugs are the same) is a much more moderate number. It’s quite common for the frequentist’s p-value to be much closer to zero than the Bayesian’s posterior probability. So any time you see a scientific paper claiming a result with p < 0.05, this is nothing like 95% certainty that the result is real.

This is one reason why so many experimental results aren’t being reproduced; the evidence for them was never strong anyway. In fact many studies reporting p < 0.05 may actually have found evidence in favour of the null hypothesis, meaning a “failed” replication attempt is actually a success!

Good statisticians know that p-values and posterior probabilities are not the same thing, and try to impress it upon their students and scientific colleagues, often without much success. However, a simpler solution exists. If we want to know the plausibility of a hypothesis, why don’t we just calculate it? If you want to know X , calculate X . Don’t calculate Y instead, try to use it for insight about X , and then complain about all the people interpreting Y as if it were X . It’s not their fault.

The main criticism of Bayesian methods is that they’re subjective; the results depend not just on the data being analysed but also on the other ingredients you must supply, the prior probabilities (how plausible were your hypotheses before taking into account the data?). It’s true these extra assumptions are needed, but this is just accepting reality. For example, if we had assumed the new drug’s effectiveness is probably extremely close to that of the old drug, the data would have been uninformative about the question of “equal or not”, and the posterior probability would have equalled the prior probability. This is not a problem. It is logic.

Things have improved markedly for Bayesian statisticians over the last few decades. These days, you’re unlikely to encounter any hostility by doing a Bayesian analysis. Pragmatism is the most popular attitude. It’s possible to do interesting and useful analyses using tools arising from frequentist thinking, Bayesian thinking, creative invention, or a mixture of all of these, and most statisticians are happy to do so.

One downside of this ecumenicalism is a reluctance to ask fundamental questions: having a strong opinion on this matter has gone out of fashion. Who’s to say one statistical philosophy is better than another? Aren’t all statistical philosophies equally valid paths to good data analysis? Frequentism is “true for me”. As in religion, so in statistics. If you criticise a colleague for using p-values when posterior probabilities are clearly more appropriate, it will lead to accusations of being a ‘zealot’ [2] who should stop ‘crusading’.

A year ago I went to a talk by prominent ‘skeptic’ Steven Novella, in which he advocated for a Bayesian approach to judging the plausibility of medical hypotheses. During the question and answer session, a statistics department colleague of mine raised his hand and said Bayesian statistics was ‘bullshit’ because degrees of plausibility are not empirically measurable. I disagree strongly, but it was refreshing to see someone willing to argue for a view.

Another, more consequential downside is a reluctance to abandon bad ideas. Frequentist confidence intervals and p-values should still be taught to some extent; since so much research is based on them, our students need to know what they are. Other ideas that arose from frequentism, such as maximum likelihood estimation and the notion of ‘bias’, are still useful for obtaining fast approximate results and for computational algorithms respectively. In some difficult research problems they might help if a Bayesian solution is too mathematically or computationally difficult to obtain.

Yet it is simply false to claim that we teach these methods because they’re better or easier than Bayesian ones. I would argue they are both worse and more difficult. In physics, undergraduate students learn Newton’s ideas about gravity before Einstein’s, because they’re much easier conceptually and mathematically, and give the right answer on many problems even though Einstein’s theory is more correct. If Einstein’s theory were easier (as well as being more accurate), teaching Newton’s would be silly. Yet that’s the way most statistics curricula are structured. The only reason statisticians think frequentist ideas are easier is that they are used to them. The only reason they think Bayesian ideas must wait until graduate school is that the lower-level textbooks haven’t been written (although some easier material exists, such as the books by Allen Downey [3] and John Kruschke [4], and my lecture notes [5]).

It’s time to change this. Let’s teach Bayes first.

Brendon Brewer is a senior lecturer in the Department of Statistics in Auckland. Follow him on Twitter @brendonbrewer

References

  1. http://www.nature.com/news/over-half-of-psychology-studies-fail- reproducibility-test-1.18248
  2. http://simplystatistics.org/2013/11/26/statistical-zealots/ [3] http://greenteapress.com/thinkbayes/
  3. Kruschke, John. Doing Bayesian data analysis: A tutorial introduction. R. Academic Press, 2010.
  4. http://github.com/eggplantbren/STATS331

21 Comments

  1. So Brendon. You know I don’t fundamentally disagree with you. How do you reconcile firstly that the decision process would be the same – under both circumstances you would hope a practitioner would choose the new drug? Secondly, if we want to quantify the success rate, then the credible interval with a conjugate uniform prior (0.744, 0.890) is essentially the same as that calculated by normal sampling theory (0.755, 0.904)

    Neither interval contains the old success rate, and both are above it, leading the practitioner to (perhaps incorrectly in the frequentist case) conclude that the new method is at least 4-5% better.

    • I don’t think the decision process would be the same, because the success rate is only one factor we should consider. Other factors, including cost and side effects, should also be part of the decision process. To do this kind of cost/risk/benefit analysis, we need the entire distribution of effect size, which is exactly what the Bayesian analysis produces. The conventional analysis provides almost no guidance for this kind of decision making.

    • Richard D. Morey says

      James, the practice of using computing a Bayesian credible interval, looking to see whether a value is in it, then accepting/rejecting that value, is incorrect. Or, as Berger (2006) put it, “simply wrong,” though some Bayesians (including Kruschke) advocate it. It is a fundamental confusion between the total belief in the interval (95%), and the probability of its constituents (which are all exactly 0; no value in a continuous interval has *any* probability, so accepting any value is just wrong). This fallacy is an variant of the so-called fallacy of division (“the interval has high probability, thus the values in it do too”).

      In order to “accept” a value in the Bayesian paradigm, that specific value must be given prior probability to begin with. The “reject if outside the interval” is justified as a frequentist procedure, but not as a Bayesian one. So there are real differences here.

      (Berger, J. O. (2006), Bayes Factors, Encyclopedia of statistical sciences, John Wiley & Sons, 1, pp. 378-386, Kotz, Samuel and Balakrishnan, N. and Read, Cambell and Vidakovic, Brani and Johnson, Norman L., Hoboken (eds) New Jersey, Second edition)

      • And Richard’s broader point has been known since Wrinch and Jeffreys (1919, 1921, 1923). That is, for any law to ever possibly gain appreciable final probability it must have been assigned some finite initial probability. Then Jeffreys followed this principle in the development of significance tests (Bayes factors).

        Apparently nobody reads the foundations anymore!

      • Kruschke does not actually fall into the problem you state because he relies on the interval of practical equivalence around the zero (“Region of Practical Equivalence (ROPE)”, Section 5.3 in his book), which means that his null hypothesis is an interval, not just a point, and thus a probability can be ascribed to it.

  2. Hi James,

    I’m not sure I totally understand your first comment about decisions. Perhaps we can discuss that offline over a beer one day in the near future. With regard to confidence intervals, it’s true they produce the same numerical results as Bayesian credible intervals in many circumstances. Those circumstances (flatish priors, gaussian likelihood function) are very common so pragmatically there won’t be much difference.

    The wrong message to draw from that is that they’re the same concept, and of course you know this. My favourite example is one due to Jaynes (1976) where the whole confidence interval is known from the data to be impossible. Trying to be right 95% of the time is a good goal but in general it conflicts with the goal of reasoning well from the particular data you have.

  3. Pingback: 1 – The Great Statistical Schism - Exploding Ads

  4. Pingback: Article in Quillette | Plausibility Theory

  5. I don’t understand where the 0.5 prior came from in the Bayesian analysis of the drug test. If you’re used to testing 1000’s of compounds for each useful drug you produce, wouldn’t you expect the prior to be much lower? This seems as susceptible to p-hacking as anything that can happen with informed use of confidence intervals. So the frequency approach seems to produce results starting from fewer questionable presumptions.

    • Hi Paul. Good question that really gets to the heart of people’s misgivings about Bayesian inference (I tried to address this briefly in my paragraph beginning “The main criticism…”). A prior probability of 0.5 for some proposition and its negation (e.g. ‘the drugs are the same’ vs. ‘the drugs are different’) is the simplest example of a so-called (I don’t like the name) ‘objective prior’. It implies the prior information has a symmetry, almost as if you don’t know the meaning of the proposition yet. You sometimes hear people joke “either it’s true or it’s not, so it’s 50/50”. That’s what I did.

      Of course, you’re welcome to question that assumption and argue for something different, especially in a complex real life situation. But it’s an inescapable property of reasoning under uncertainty that the answer you’re looking for (how plausible is a proposition now) depends on the prior information you have. Admitting this is a matter of intellectual honesty and transparency. To resolve debates about what the prior probabilities should be, you are allowed to look at everything that is known except the data. If you switch to an alternative calculation on the same data, you are no longer talking about the plausibility of hypotheses.

      • Bertrand Russell once wrote something like this –

        There are exactly two possibilities. Either the next man I meet is named
        Ebenezer Wilkes Smith or he is not so named. Therefore there is a 50%
        chance that the next man I meet will be named Ebenezer Wilkes Smith.

  6. Pingback: Stop Teaching Frequentism; More On That “Autism” Study; Etc. | William M. Briggs

  7. Bill R says

    Brendon,

    You “disagree strongly” that degrees of plausibility are not measureable? How does one meaure them, then? Or do you mean that there is no need for empirical meaning?

    Bill

    • Hi Bill,

      I disagreed strongly with the notion that not being measurable implied it was bullshit. That sentence could have been more clear. I do think plausibilities are kinda measurable in the ‘subjective Bayesian’ sense (where they interview subject matter experts and use it to assign probabilities that are a good model of what the expert thinks).

  8. Bill R says

    Hi Brendon,

    Thanks for the clarification. If you are using these in a science/engineering context, how do you explain this to hard-core empiricists? Their notions of probability tend to revolve around repeated experiments or the results of process changes, both of which are very frequentist.

    Regards

  9. How were the Bayesian (0.11) and Frequentist (0.006) arrived at? Could you give the calculation steps? Thanks.

  10. Thanks for this. Especially the link to your course notes. I’ve printed them out for reading. I’ve wanted to get some idea of Bayesian stuff but, as you say, all the courses are at grad-level and outside my ability.

  11. Pingback: Outside in - Involvements with reality » Blog Archive » Chaos Patch (#89)

  12. I don’t want to be a debbie downer, but this post was rather unclear, especially for those without any prior knowledge of statistics. I’ve read many explanations of Bayesian theory before (on Less Wrong, etc), and this one was not clear. If these posts are intended for lay audiences, the writng needs to pass under the eye of a discerning editor.

Comments are closed.