The omnibus F-test may be ignored if you use multiple comparison procedures

I think  trying to be scientific with a small s involves asking critical questions about  common wisdom or common practice. In this post, I would like to focus on multiple comparisons in the context of ANOVA. What does common practice indicate?

Common wisdom suggests doing multiple comparisons only if the F-test is significant

Let’s have a look on some practical advice considering multiple comparisons found on the web (R-bloggers.com) and in Field (2015).

“One way to begin an ANOVA is to run a general omnibus test. The advantage to starting here is that if the omnibus test comes up insignificant, you can stop your analysis and deem all pairwise comparisons insignificant. If the omnibus test is significant, you should continue with pairwise comparisons” (https://www.r-bloggers.com/r-tutorial-series-one-way-anova-with-pairwise-comparisons/)

“When we have a statistically significant effect in ANOVA and an independent variable of more than two levels, we typically want to make follow-up comparisons. There are numerous methods for making pairwise comparisons and this tutorial will demonstrate how to execute several different techniques in R.” (https://www.r-bloggers.com/r-tutorial-series-anova-pairwise-comparison-methods/)
And have a look at how the text book I used to use in my statistics course explains it.

“It might seem a bit unhelpful that an ANOVA doesn’t tell you which groups are different from which, given that having gone to the trouble of running an experiment, you probably need to know more than ‘there’s some difference somewhere or other’. You might wonder, therefore, why we don’t just carry out a lot of t-tests, which would tell us very specifically whether pairs of group means differ. Actually, the reason has already been explained in Section 2.1.6.7: every time you run multiple tests on the same data you inflate the potential Type I errors that you make. However, we’ll return to this point in Section 11.5 when we look at how we follow up an ANOVA to discover where the group difference lie.” (Field, 2015, p. 442).
Although, in honesty, on p. 459 Field writes:

“The least significance difference (LSD) pairwise comparison makes no attempt to control Type I error and is equivalent to performing multiple t-tests on the data. The only difference is that LSD requires the overall ANOVA to be significant.”

This is meant to inform about the relative merits of one post hoc procedure to another in terms of Type I and Type II error.  Crucially, it is not mentioned that the other post-hoc procedures require that the overall ANOVA be significant. (As common wisdom seems to suggest). However,  his flow-chart of the ANOVA procedure (p. 460) clearly suggests multiple comparison procedures should be used as post-hoc procedures (after the ANOVA is significant).

Thus, common “statistical” wisdom seem to suggest that multiple comparison procedures are to be used as post hoc procedures following up a significant omnibus F-test. And the reason is that this two-stepped procedure minimizes the probability of type I errors.

Now, let’s ask ourselves whether this common sense is, well, sensible.

Multiple comparisons only after significant F-test affects power negatively

Wilcox (2017) contains some useful information regarding our question. In his discussion of the much used Tukey-HSD procedure (the Tukey-Kramer Method), he references Bernhardson, (1975) who shows that the probability of at least 1 type I error among pairwise comparisons of estimates of equal population means (i.e. true null-hypotheses) is no longer equal to \alpha if the procedure is only carried out following a significant omnibus test. That is, if we use our beloved two step procedure.

The consequence of the two step procedure for the Tukey-HSD is that \alpha is reduced. Thus, if we want our multiple comparisons procedure to generate one type I error or more at most with a probability of  \alpha = .05, using the 2 step procedure leads to a lowered \alpha. This is of course, bad news, because in the event that not all of the null-hypotheses are true, lowering \alpha increases \beta, the probability of not rejecting when the null-hypothesis is false (keeping the sample size constant, of course). In other words, the two step procedure decreases the power of the multiple comparison procedure.

In the words of Wilcox (2017):

“In practical terms, when it comes to controlling the probability of at least one type I error, there is no need to first reject with the ANOVA F test to justify using the Tukey-Kramer method. If the Tukey-Kramer method is used only after the F test rejects, power can be reduced. Currently, however, common practice is to use the Tukey-Kramer method only if the F-test rejects. That is, the insight reported by Bernhardson is not yet well known.”  (p. 385).

In conceptual terms,  the fact that the probability of at least one type I error in the multiple comparison procedure is  smaller than \alpha if the F-test rejects is pretty clear, at least to me it is. Suppose we reject if the p-value of the F-test is smaller or equal to 5%. This will also be the probability that we conduct the multiple comparison test over repeated replications of the same experiment. Of that 5%, not every application of the procedure will result in at least one type I error. Indeed, a puzzling fact for many beginning researchers is that the F-test is significant while none of the pairwise comparisons is. In other words, some of those 5%  of the cases in which we perform the procedure following a significant F-test will probably not reject any of the pairwise null-hypotheses, unless it is guaranteed that at least one type I error per application will be made.

(With no adjustment of \alpha for multiple comparisons, this will happen (with high probability so no guarantee) if a huge number of pairwise comparisons are made. For instance, with 99 unadjusted multiple comparisons the probability of at least one type I error is 99%.; this is why it makes sense to demand that the F-test is significant before testing multiple comparisons with the LSD procedure. Although the latter seems to run into trouble with more than 3 groups (Wilcox, 2017).

A quick simulation study

My hunch is that the two-step procedure is unnecessary for the Tukey-Kramer method as well as for other multiple comparison procedures (the exception Fisher’s LSD procedure which was designed as a post hoc procedure to be used as a follow up after a significant F-test, as Field (2015) rightly points out), but I only focused on the Tukey-Kramer method. What I did was a simple simulation study with a four group between subjects design (all \mu‘s equal) and estimated the probability of at least type I error both with and without using the 2 step procedure.

set.seed(456)
#number of groups
ngr = 4

#number of participants
n = 40

#group is a factor
gr <- factor(rep(1:ngr, each=n))

#vector for storing rejections F-test
Reject <- rep(0, 10000)

#vector for storing #rejections multiple
#comparisons
RejectHSD <- rep(0, 10000)

for (i in 1:10000) {

y = rnorm(ngr*n)
mod = aov(y ~ gr)
Reject[i] = anova(mod)$"Pr(>F)"[1] <= .05
PS <- TukeyHSD(mod)$gr[,4]
RejectHSD[i] = sum(PS <=.05)
}

#probability type I error F-test
sum(Reject)/length(Reject)
## [1] 0.0515
#probability at least one type I error Tukey HSD
sum(RejectHSD > 0) / length(RejectHSD)
## [1] 0.0503
#probability at least one type I error given F-tests Rejects
sum(RejectHSD[Reject==TRUE] > 0) / length(RejectHSD)
## [1] 0.0424

Even though a single (relatively tiny) simulation (which, by the way, takes a long time to run, nonetheless), is not necessarily convincing, it does  illustrate the main points of this post. First, the probability of at least one incorrect rejection using the TukeyHSD function is close to .05. With this particular random seed it even performs a little better than the ANOVA F-test: .0503 versus .0515. This illustrates that even without considering whether the omnibus test is significant the main demand of not rejecting too many true null-hypotheses is completely satisfied. So, in practical terms, you can safely ignore the omnibus test if your concerns are about  \alpha.

Second, the probability of incorrectly rejecting at least one true pair-wise null-hypothesis after the ANOVA F-test is significant is estimated to be .0424. This shows, that the two-step procedure leads to a larger decrease in the actual type I error probability than is wanted. Even though this may seem good news from the perspective of avoiding type I errors, the down side is that pair wise null-hypotheses that are false (and potentially important) may not be detected.

Conclusion

Common wisdom and practice suggest that multiple comparisons procedures should be done only after a significant omnibus test. We have seen that this is not at all necessary if we use a multiple comparisons procedure that is designed to control the type I error probability. To my knowledge, most of the procedures conventionally thought of as post hoc tests are designed in this manner, the exception being the LSD procedure which does require a significant F-test. For practical purposes, then, do not bother with the omnibus test (note the exception) if you are planning to pair wise compare all the treatment means.
This practical advice does not mean, of course, that I am suggesting you spend your time comparing all treatment means. Most of the time, focused comparisons are a more fruitful way of analysing your data. But I’ll leave that topic for another time.

References

Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. 4th Edition. London: Sage.
Wilcox, R. (2017). Understanding and Applying Basic Statistical Methods Using R. Hoboken, NJ: Wiley,

Planning for Precision: simulation results for four designs with four conditions

This is the third post about the Planning for Precision app (in the future I’ll explain the difference between Planning for Precision and Precision for Planning). Some background information about the application can be found here: http://the-small-s-scientist.blogspot.com/2017/04/planning-for-precision.html. 

In this post, I want to present the simulation results for 4 designs with 4 conditions. The designs are: the counter balanced design (see previous post), the fully-crossed design, the stimulus-within-condition design, and the stimulus-and-participant-within-condition design (the both-within-condition design). I have not included the participants-within-condition design, because this is simply the mirror-image (so to say) of the stimulus-within-condition design.

In one of my next posts, I will describe some more background information about planning for precision, but some of the basics are as follows. We have a design with 4 treatment conditions, and what we want do is to estimate differences between these condition means by using contrasts. For instance, we may be interested in the (amount of) difference between the first mean, maybe because it is a control-condition with the average of the other three conditions: μ1 – (μ2 + μ3 + μ4)/3 =  1*μ1 – 1/3*μ2 -1/3*μ3 – 1/3*μ4.  The values {1, -1/3, -1/3, -1/3} are the contrast weights, and for the result we use the term ψ.

The value of ψ is estimated on the basis of estimates of the population means, that is, the sample means or condition means. Due to sampling error, the contrast estimate varies from sample to sample and the amount of sampling error can be expressed by means of a confidence interval. Conceptually, the confidence interval expresses the precision of the estimate: the wider the confidence interval, the less precise the estimate is.

The Margin of Error (MOE) of an estimate is the half-width of the confidence interval, so the confidence interval is the estimate plus or minus MOE. We will take MOE as an expression of the precision of the estimate (the less the value of MOE the more precise the estimate).  Now, if you want to estimate an effect size, more precision (lower value of MOE; less wide confidence interval) is better than less precision (higher value of MOE; wider confidence interval).  The app let’s you specify the design and the contrast weights and helps you find the minimum required sample sizes (for participants and stimuli) for a given target MOE. (You can also play with the designs to see which design gives you smallest expected MOE).

Crucially, if you plan for precision, you also want to have some assurance that the MOE you are likely to obtain in you actual experiment will not be larger than you target MOE. Compare this with power: 80% power means that the probability that you will reject the null-hypothesis is 80%. Likewise, assurance MOE of 80% means that there is an 80% probability that your obtained MOE will be no larger than assurance MOE.

The simulations (with N = 10000 replications) estimate Expected MOE as well as Assurance MOE for assurances of .80, .90, .95, and .99, for 4 designs with 4 treatment conditions, with a total number of 48 participants and 24 stimuli (items).  The MOEs are given for three standard constrasts: 1) the difference between the first mean and the mean of the other three, with weights {1, -1/3, -1/3, -1/3}; 2) the difference between the second mean and the mean of conditions three and four, with weights {0, 1, -1/2, -1/2}; 3) the difference between the third and fourth condition means, with weights {0, 0, 1, -1}.

I will present the results in  separate tables for the 4 designs considered and include percentage difference between expected values of assurance MOE and the estimated values estimated values.

The fully crossed design 

The results are in the following table.

The percentage difference between the expected quantiles (= assurance MOEs for given insurance;  i.e. q.80 is expected or estimated  80% Assurance MOE) and the estimated quantiles are: .80: 0.11%; .90: 0.05%; .95: -0.14%; 99: -0.05%.

The counter balanced design 

The results are presented in the following table. 

The percentage difference between the expected quantiles and the estimated quantiles are: .80: 0.03%; .90: 0.13%;  .95: 0.09%, .99: -0.23%.

The stimulus-within-condition design 

The following table contains the details. 
The percentage difference between the expected quantiles and the estimated quantiles are: .80: -0.11%; .90: -0.33%;  .95: -0.55%, .99: -0.70%.  

Both-participant-and-stimulus-within-condition design 

Here is the table. 
And the percentage differences are: .80: -0.34%; .90: -0.59%;  .95: -0.82%;  .99: -1.06%. 

Conclusion

The results show that the simulation results are quite consistent with the expected values based on mixed model ANOVA. We can see that the differences between expected and estimated values increase the less the number of participants and items per condition. For instance, in the both within condition design 12 participants respond to 6 stimuli in one of the four treatment conditions. The fact that even with these small samples sizes the results seem to agree to an acceptable degree is (to my mind) encouraging. Note that with small samples the expected assurance MOES are slightly lower than the estimates, but the largest difference is -1.06% (see the MOE for 99% assurance). 

Planning for Precision: first simulation results

In this post, I want to share the results of the first simulation study to “test” my Planning for Precision app. More details about the app can be found in a previous post: here.

I have included the basic logic of the simulations (including R code) in a document that you can download: https://drive.google.com/open?id=0B4k88F8PMfAhSlNteldYRWFrQTg.

The simulation study simulates responses from a four condition counter balanced design, with p = 48 participants and q = 24 stimuli/items. Here, we will focus on expected and assurance MOE for three contrasts. The first contrast estimates the difference between the first mean and the average of the other three, the second contrast the difference between the second mean and the average of the third and fourth means, and the final contrast the difference between the means of the third and fourth contrasts.

Expected MOE is compared to the mean of the estimated MOE for each of the contrasts (based on 10000 replications). Assurance MOE is judged for assurance of .80, .90, .95 and .99, by comparing the calculations in the app with the corresponding quantile estimates of the simulated distributions.

Results 

Note that in the above table, the Expected Mean MOE is what I have called Expected MOE, and the q.80 through q.99 are quantiles of the distribution of MOE. As an example, q.80 is the quantile corresponding to assurance MOE with 80% assurance, Expected q.80 is the value of assurance MOE calculated with the theoretical approach, and Estimated q.80 is the estimated quantile based on the simulation studies. 
Importantly, we can see that most of the figures agree to a satisfying degree. If we look at the relative differences, expressed in percentages for the assurance MOEs, we get 0.0325% for q.80,  0.1260% for q.90, 0.0933% for q.95, and  -0.2324% for q.99. 

Conclusion

The first simulation results seem promising. But I still have a lot of work to do for the rest of the designs.