## Decisions are not evidence

The thinking that lead to this post began with trying to write something about what Kline (2013) calls the filter myth. The filter myth is the arguably – in the sense that it depends on who you ask – mistaken belief in NHST practice that the p-value discriminates between effects that are due to chance (null-hypothesis not rejected) and those that are real (null-hypothesis rejected). The question is whether decisions to reject or not reject can serve as evidence for the existence of an effect.

Reading about the filter myth made me wonder whether NHST can be viewed as a screening test (diagnostic test), much like those used in medical practice. The basic idea is that if the screening test for a particular condition gives a positive result, follow-up medical research will be undertaken to figure out whether that condition is actually present. (We can immediately see, by the way, that this metaphor does not really apply to NHST, because the presumed detection of the effect is almost never followed up by trying to figure out whether the effect actually exists, but the detection itself is, unlike the screening test, taken as evidence that the effect really exists; this is simply the filter myth in action).

Let’s focus on two properties of screening tests. The first property is the Positive Likelihood Ratio (PLR). The PLR is the ratio of the probability of a correct detection to the probability of a false alarm. In NHST-as-screening-test, the PLR  equals the ratio of the power of the test to the probability of a type I error: PLR = (1 – β) / α. A high value of the PLR means, basically, that a rejection is more likely to be a rejection of a false null than a rejection of a true null, thus the PLR means that a rejection is more likely to be correct than incorrect.

As an example, if β = .20, and α = . 05, the PLR equals 16. This means that a rejection is 16 times more likely to be correct (the null is false) than incorrect (the null is true).

The second property I focused on is the Negative Likelihood Ratio (NLR). The NLR is the ratio of the frequency of incorrect non-detections to the frequency of correct non-detections. In NHST-as-screening-test, the NLR equals the ratio of the probability of a type II error to the probability of a correct non-rejection: NLR = β / (1 – α). A small value of the NLR means, in essence, that a non-rejection is less likely to occur when the null-hypothesis is false than when it is true.

As an example, if β = .20, and α = . 05, the NLR equals .21. This means that a non-rejection is .21 times more likely (or 4.76 (= 1/.21) times less likely) to occur when the null-hypothesis is false, than when it is true.

The PLR and the NLR can be used to calculate the likelihood ratio of the alternative hypothesis to the null-hypothesis given that you have rejected or given that you have not-rejected, the posterior odds of the alternative to the null. All you need is the likelihood ratio of the alternative to the null before you have made a decision and you multiply this by the PLR after you have rejected, and by the NLR after you have not rejected.

Suppose that we repeatedly (a huge number of times) take a random sample from a population of null-hypotheses in which 60% of them are false and 40% true. If we furthermore assume that a false null means that the alternative must be true, so that the null and the alternative cannot both be false, the prior likelihood of the alternative to the null equals p(H1)/p(H0) = .60/.40 = 1.5. Thus, of all the randomly selected null-hypotheses, the proportion of them that are false is 1.5 times larger than the proportion of  null-hypotheses that are true. Let’s also repeatedly sample (a huge number of times) from the population of decisions. Setting β = .20, and α = . 05, the proportion of rejections equals p(H1)*(1 – β) + p(H0)*α = .60*.80 + .40*.05 = .50 and the proportion of non-rejections equals p(H1)*β + p(H0)*(1 – α) = .60*.20 + .40*.95 = .50. Thus, if we sample repeatedly from the population of decisions 50% of them are rejections and 50% of them are non-rejections.

First, we focus only on the rejections. So, the probability of a rejection is now taken to be 1.0.  The posterior odds of the alternative to the null, given that the probability of a rejection is 1.0, is the prior likelihood ratio multiplied by the PLR: 1.5 * 16 = 24. Thus, we have a huge number of rejections (50% of our huge number of randomly sampled decisions) and within this huge number of rejections the proportion of rejections of false nulls is 24 times larger than the proportion of rejections of true nulls. The proportion of rejections of false nulls equals the posterior odds / (1 + posterior odds) = 24 / 25 = .96. (Interpretation: If we repeatedly sample a null-hypothesis from our huge number of rejected null-hypotheses, 96% of those samples are false null-hypotheses).

Second, we focus only on the non-rejections. Thus, the probability of a non-rejection is now taken to be 1.0. The posterior odds of the alternative to the null, given that the probability of a non-rejection is 1.0, is the prior odds multiplied by the NLR: 1.5 * 0.21 = 0.32. In other words, we have a huge number of non-rejections (50% of our huge sample of randomly selected decisions) and the proportion of non-rejections of false nulls is 0.32 times as large as the proportion of non-rejections of true nulls. The proportion of non-rejections of false nulls equals 0.32 / ( 1 + 0.32) =  .24. (Interpretation: If we repeatedly sample a null-hypothesis from our huge number of non-rejected hypotheses, 24% of them are false nulls).

So, based on the assumptions we made, NHST seems like a pretty good screening test, although in this example NHST is much better at detecting false null-hypothesis than ruling out false alternative hypotheses. But how about the question of decisions as evidence for the reality of an effect? I will first write a little bit about the interpretation of probabilities, then I will show you that decisions are not evidence.

Sometimes, these results are formulated as follows. The probability that the alternative is true given a decision to reject is .96 and the probability that the alternative hypothesis is true given a decision to not-reject  is .24.  If you want to correctly interpret such a statement, you have to keep in mind what “probability” means in the context of this statement, otherwise it is very easy to misinterpret the statement’s meaning. That is why I included interpretations of these results that are consistent with the meaning of the term probability as it used in our example. (In conceptual terms, the limit of the relative frequency of an event (such as reject or not-reject) as the number of random samples (the number of decisions) goes to infinity).

A common (I believe) misinterpretation (given the sampling context described above) is that rejecting a null-hypothesis makes the alternative hypothesis likely to be true. This misinterpretation is easily translated to the incorrect conclusion that a significant test result (that leads to a rejection) makes the alternative hypothesis likely to be true. Or, in other words, that a significant result is some sort of evidence for  the alternative hypothesis (or against the null-hypothesis).

The mistake can be described as confusing the probability of a single result with the long term (frequentist) probabilities associated with the decision or estimation procedure. For example, the incorrect interpretation of the p-value as the probability of a type I error or the incorrect belief that an obtained 95% confidence interval contains the true value of a parameter with  probability .95.

A quasi-simple example may serve to make the mistake clear. Suppose I flip a fair coin, keep the result hidden from you, and let you guess whether the result is heads or tails (we assume that the coin will not land on it’s side). What is the probability that your guess is correct?