Type I error probability does not destroy the evidence in your data

 Have you heard about that experimental psychologist? He decided that his participants did not exist, because the probability of selecting them, assuming they exist, was very small indeed (p < .001). Fortunately, his colleagues were quick to reply that he was mistaken. He should decide that they do exist, because the probability of selecting them, assuming they do not exist, is very remote (p < .001). Yes, even unfunny jokes can be telling about the silliness of significance testing.

But sometimes the silliness is more subtle, for instance in a recent blog post by Daniel Lakens, the 20% Statistician with the title “Why Type I errors are more important than Type 2 errors (if you care about evidence).” The logic of his post is so confused, that I really do not know where to begin. So, I will aim at his main conclusion that type I error inflation quickly destroys the evidence in your data.

(Note: this post uses mathjax and I’ve found out that this does not really work well on a (well, my) mobile device. It’s pretty much unreadable).

Lakens seems to believe that the long term error probabilities associated with decision procedures, has something to do with the actual evidence in your data. What he basically does is define evidence as the ratio of power to size (i.e. the probability of a type I error), it’s basically a form of the positive likelihood ratio

    \[PLR = \frac{1 - \beta}{\alpha},\]

which makes it plainly obvious that manipulating \alpha (for instance by multiplying it with some constant c) influences the PLR more than manipulating \beta by the same amount.  So, his definition of  “evidence” makes part of his conclusion true, by definition:  \alpha has more influence on the PLR than \beta,  But it is silly to reason on the basis of this that the type I error rate destroys the evidence in your data.

The point is that \alpha  and \beta (or the probabilities of type I errors and type II errors) have nothing to say about the actual evidence in your data. To be sure, if you commit one of these errors, it is the data (in NHST combined with arbitrary i,e, unjustified cut-offs) that lead you to these errors. Thus, even \alpha = .01 and \beta = .01, do not guarantee that actual data lead to a correct decision.

Part of the problem is that Lakens confuses evidence and decisions, which is a very common confusion in NHST practice. But, deciding to reject a null-hypothesis, is not the same as having evidence against it (there is this thing called type I error). It seems that NHST-ers and NHST apologists find this very very hard to understand. As my grandmother used to say: deciding that something is true, does not make it true

I will try to make plausible that decisions are not evidence (see also my previous post here). This should be enough to show you that error probabilities associated with the decision procedure tells you nothing about the actual evidence in your data. In other words, this should be enough to convince you that Type 1 error rate inflation does not destroy the evidence in your data, contrary to the 20% Statistician’s conclusion.

Let us consider whether the frequency of correct (or false) decisions is related to the evidence in the data. Suppose I tell you that I have a Baloney Detection Kit (based for example on the baloney detection kit at skeptic.com) and suppose I tell you that according to my Baloney Detection Kit the 20% Statistician’s post is, well, Baloney. Indeed, the quantitative measure (amount of Baloneyness) I use to make the decision is well above the critical value. I am pretty confident about my decision to categorize the post as Baloney as well, because my decision procedure rarely leads to incorrect decisions. The probability that I decide that something is Baloney when it is not is only \alpha = .01 and the probability that I decide that something is not-Baloney when it is in fact Baloney is only 1% as well (\beta = .01).

Now, the 20% Statistician’s conclusion states that manipulating \alpha, for instance by setting \alpha = .10 destroys the evidence in my data. Let’s see. The evidence in my data is of course the amount of Baloneyness of the post. (Suppose my evidence is that the post contains 8 dubious claims). How does setting \alpha have any influence on the amount of Baloneyness? The only thing setting \alpha does is influence the frequency of incorrect decisions to call something Baloney when it is not. No matter what value of \alpha (or \beta, for that matter) we use, the amount of Baloneyness in this particular post (i.e. the evidence in the data) is 8 dubious claims.

To be sure, if you tell the 20% Statistician that his post is Baloney, he will almost certainly not ask you how many times you are right and wrong on the long run (characteristics of the decision procedure), he will want to see your evidence. Likewise, he will probably not argue that your decision procedure is inadequate for the task at hand (maybe it is applicable to science only and not to non-scientific blog posts), but he will argue about the evidence (maybe by simply deciding (!) that what you are saying is wrong; or by claiming that the post does not contain 8 dubious claims, but only 7).

The point is, of course, this: the long term error probabilities \alpha and \beta associated with the decision procedure, have no influence on the actual evidence in your data.  The conclusion of the 20% Statistician is simply wrong. Type I error inflation does not destroy the evidence in your data, nor does type II error inflation.

Decisions are not evidence

The thinking that lead to this post began with trying to write something about what Kline (2013) calls the filter myth. The filter myth is the arguably – in the sense that it depends on who you ask – mistaken belief in NHST practice that the p-value discriminates between effects that are due to chance (null-hypothesis not rejected) and those that are real (null-hypothesis rejected). The question is whether decisions to reject or not reject can serve as evidence for the existence of an effect.

Reading about the filter myth made me wonder whether NHST can be viewed as a screening test (diagnostic test), much like those used in medical practice. The basic idea is that if the screening test for a particular condition gives a positive result, follow-up medical research will be undertaken to figure out whether that condition is actually present. (We can immediately see, by the way, that this metaphor does not really apply to NHST, because the presumed detection of the effect is almost never followed up by trying to figure out whether the effect actually exists, but the detection itself is, unlike the screening test, taken as evidence that the effect really exists; this is simply the filter myth in action).

Let’s focus on two properties of screening tests. The first property is the Positive Likelihood Ratio (PLR). The PLR is the ratio of the probability of a correct detection to the probability of a false alarm. In NHST-as-screening-test, the PLR  equals the ratio of the power of the test to the probability of a type I error: PLR = (1 – β) / α. A high value of the PLR means, basically, that a rejection is more likely to be a rejection of a false null than a rejection of a true null, thus the PLR means that a rejection is more likely to be correct than incorrect.

As an example, if β = .20, and α = . 05, the PLR equals 16. This means that a rejection is 16 times more likely to be correct (the null is false) than incorrect (the null is true).

The second property I focused on is the Negative Likelihood Ratio (NLR). The NLR is the ratio of the frequency of incorrect non-detections to the frequency of correct non-detections. In NHST-as-screening-test, the NLR equals the ratio of the probability of a type II error to the probability of a correct non-rejection: NLR = β / (1 – α). A small value of the NLR means, in essence, that a non-rejection is less likely to occur when the null-hypothesis is false than when it is true.

As an example, if β = .20, and α = . 05, the NLR equals .21. This means that a non-rejection is .21 times more likely (or 4.76 (= 1/.21) times less likely) to occur when the null-hypothesis is false, than when it is true.

The PLR and the NLR can be used to calculate the likelihood ratio of the alternative hypothesis to the null-hypothesis given that you have rejected or given that you have not-rejected, the posterior odds of the alternative to the null. All you need is the likelihood ratio of the alternative to the null before you have made a decision and you multiply this by the PLR after you have rejected, and by the NLR after you have not rejected.

Suppose that we repeatedly (a huge number of times) take a random sample from a population of null-hypotheses in which 60% of them are false and 40% true. If we furthermore assume that a false null means that the alternative must be true, so that the null and the alternative cannot both be false, the prior likelihood of the alternative to the null equals p(H1)/p(H0) = .60/.40 = 1.5. Thus, of all the randomly selected null-hypotheses, the proportion of them that are false is 1.5 times larger than the proportion of  null-hypotheses that are true. Let’s also repeatedly sample (a huge number of times) from the population of decisions. Setting β = .20, and α = . 05, the proportion of rejections equals p(H1)*(1 – β) + p(H0)*α = .60*.80 + .40*.05 = .50 and the proportion of non-rejections equals p(H1)*β + p(H0)*(1 – α) = .60*.20 + .40*.95 = .50. Thus, if we sample repeatedly from the population of decisions 50% of them are rejections and 50% of them are non-rejections.

First, we focus only on the rejections. So, the probability of a rejection is now taken to be 1.0.  The posterior odds of the alternative to the null, given that the probability of a rejection is 1.0, is the prior likelihood ratio multiplied by the PLR: 1.5 * 16 = 24. Thus, we have a huge number of rejections (50% of our huge number of randomly sampled decisions) and within this huge number of rejections the proportion of rejections of false nulls is 24 times larger than the proportion of rejections of true nulls. The proportion of rejections of false nulls equals the posterior odds / (1 + posterior odds) = 24 / 25 = .96. (Interpretation: If we repeatedly sample a null-hypothesis from our huge number of rejected null-hypotheses, 96% of those samples are false null-hypotheses).

Second, we focus only on the non-rejections. Thus, the probability of a non-rejection is now taken to be 1.0. The posterior odds of the alternative to the null, given that the probability of a non-rejection is 1.0, is the prior odds multiplied by the NLR: 1.5 * 0.21 = 0.32. In other words, we have a huge number of non-rejections (50% of our huge sample of randomly selected decisions) and the proportion of non-rejections of false nulls is 0.32 times as large as the proportion of non-rejections of true nulls. The proportion of non-rejections of false nulls equals 0.32 / ( 1 + 0.32) =  .24. (Interpretation: If we repeatedly sample a null-hypothesis from our huge number of non-rejected hypotheses, 24% of them are false nulls).

So, based on the assumptions we made, NHST seems like a pretty good screening test, although in this example NHST is much better at detecting false null-hypothesis than ruling out false alternative hypotheses. But how about the question of decisions as evidence for the reality of an effect? I will first write a little bit about the interpretation of probabilities, then I will show you that decisions are not evidence.

Sometimes, these results are formulated as follows. The probability that the alternative is true given a decision to reject is .96 and the probability that the alternative hypothesis is true given a decision to not-reject  is .24.  If you want to correctly interpret such a statement, you have to keep in mind what “probability” means in the context of this statement, otherwise it is very easy to misinterpret the statement’s meaning. That is why I included interpretations of these results that are consistent with the meaning of the term probability as it used in our example. (In conceptual terms, the limit of the relative frequency of an event (such as reject or not-reject) as the number of random samples (the number of decisions) goes to infinity).

A common (I believe) misinterpretation (given the sampling context described above) is that rejecting a null-hypothesis makes the alternative hypothesis likely to be true. This misinterpretation is easily translated to the incorrect conclusion that a significant test result (that leads to a rejection) makes the alternative hypothesis likely to be true. Or, in other words, that a significant result is some sort of evidence for  the alternative hypothesis (or against the null-hypothesis).

The mistake can be described as confusing the probability of a single result with the long term (frequentist) probabilities associated with the decision or estimation procedure. For example, the incorrect interpretation of the p-value as the probability of a type I error or the incorrect belief that an obtained 95% confidence interval contains the true value of a parameter with  probability .95.

A quasi-simple example may serve to make the mistake clear. Suppose I flip a fair coin, keep the result hidden from you, and let you guess whether the result is heads or tails (we assume that the coin will not land on it’s side). What is the probability that your guess is correct?

My guess is that you believe that the probability that your guess is correct is .50. And my guess is also that you will find it hard to believe that you are mistaken. Well, you are mistaken if we define probabilities as long term relative frequencies. Why? The result is either heads or tails. If the result is heads, the long term relative frequency of that result is heads. That is to say, the result is a constant and constants do not vary in the long run. Your guess is also a constant, if you guess heads, it will stay heads in the long run. So, if the result is heads, the probability that your guess (heads) is correct is 100%, however, if your guess is tails, the probability of you being correct is 0%. Likewise, if the result is tails, it will stay tails forever, and the probability that your guess is correct is 0% if your guess is heads and 100% if your guess is tails. So, the probability that your guess is correct is either 0 or 1.0, depending on the result of the coin flip, and not .50.

The probability of guessing correct is .50, however, if we repeatedly (a huge number of times) play our game and both the result of the coin flip and your guess are the result of random sampling. Let’s assume that of all the guesses you do 50% are heads and 50% are tails.  In the long run, then, there is a probability of .25 of the result being heads and your guess being heads and a probability of .25 of the result being tails and your guess being tails. The probability of a correct decision is therefore, .25 + .25 = .50

Thus, if  both the results and your guesses are the result of random sampling and we repeated the game a huge number of times, the probability that you are correct is .50. But if we play our game only once, the probability of you being correct is 0 or 1.0, depending on the result of the coin flip.

Let’s return to the world of hypotheses and decisions. If we play the decision game once, the probability that your decision is correct is 0 or 1.0, depending on whether the null-hypothesis in question is true (with probability 0 or 1.0) or false (with probability 0 or 1.0). Likewise, the probability that the null-hypothesis is true given that you have rejected is also 0 or 1.0, depending on whether the null-hypothesis in question is true or false. But if we play the decision game a huge number of times, the probability that a null-hypothesis is false, given that you have decided to reject is .96 (in the context of the situation described above).

In sum, from the frequentist perspective we can only assign probabilities 0 or 1.0 to a single hypothesis given we have a made single decision about it, and this probability depends on whether that single hypothesis is true or false.  For this reason, a significant result cannot be magically translated to the probability that the alternative hypothesis is true given that the test result is significant. That probability is 0 or 1.0 and there’s nothing that can change that.

The consequence of all this is as follows. If we define the evidence for or against our alternative hypothesis in terms of the likelihood ratio of the alternative to the null-hypothesis after obtaining the evidence, no decision can serve as evidence if our decision procedure is based on frequentist probabilities. Decisions are not evidence.

References 
Kline, R.B. (2013). Beyond significance testing. Statistics reform in the behavioral sciences. Second Edition. Washington: APA.

Scientific with a small s

My inspiration for this blog’s motto comes from Zilliak & McCloskey (2004). They quote from Bob Solow’s Nobel Prize acceptance speech, after which they write:

“Solow recommends we “try very hard to be scientific with a small s”; but the authors we have surveyed in the AER [American Economic Review, GM], by contrast, are trying to be scientific with a small t.” (p. 544).

Their “small t” refers to the t statistic on the basis of which researchers determine the p-values they use to assess the statistical significance of their findings. A small p (smaller than .05) is usually taken to mean that the test result is statistically significant.

There are a lot of reasons to believe that null-hypothesis significance testing (NHST) is basically unscientific. That’s why I got convinced that you cannot do science with a small p (significance testing). I hope that after reading the blog posts yet to come, you will be convinced as well.  (If you can’t wait: Kline (2014) (see below) is a good place to start getting convinced).

What does it mean to be scientific with a small s? To Solow (as cited in Zilliak & McCloskey, 2004) it simply means thinking logically and respecting the facts.  To my mind, thinking logically as a prerequisite of being scientific (with a small s) includes thinking logically about the results of statistical analyses. For instance, that you should not mistakenly believe that a small p value means that it is unlikely that a result is due to chance, or that you should not mistakenly believe that the long term behavior of a decision procedure has anything to do with the evidence in your actual data (the facts).

Zilliak & McCloskey (2004) write about economic research, but significance testing is of course not limited to economic research. Kline (2013, p. 118-199) concludes in his chapter about cognitive distortions in significance testing (and he is putting it mildly):

“Significance testing has been like a collective Rorschach inkblot test for the behavioral sciences: What we see in it has more to do with wish fulfillment than reality. This magical thinking has impeded the development of psychology and other disciplines as cumulative sciences. […] the gap between what is required for significance tests to be accurate and characteristics of real world studies is just too great.”

So, this blog is about being scientific with a small s, with a main focus on the logic and illogic of NHST, because you simply cannot do science with only a small p.

References
Kline, R.B. (2013). Beyond significance testing. Statistics reform in the behavioral sciences. Second Edition. Washington: APA.
Zilliak, S.T., & McCloskey, D.N. (2004). Size matters: the standard error of regressions in the American Economic Review, Journal of Socio-Economics, 33, 527-547.