Sample size planning for precision: the basics

In this post, I will introduce some of the ideas underlying sample size planning for precision. The ideas are illustrated with a shiny-application which can be found here: https://gmulder.shinyapps.io/PlanningApp/. The app illustrates the basic theory considering sample size planning for two independent groups. (If the app is no longer available (my allotted active monthly hours are limited on shinyapps.io), contact me and I’ll send you the code).

The basic idea

The basic idea is that we are planning an experiment to estimate the difference in population means of an experimental and a control group. We want to know how many observations per group we have to make in order to estimate the difference between the means with a given target precision.
Our measure of precision is the Margin of Error (MOE).  In the app, we specify our target MOE as a fraction (f) of the population standard deviation. However, we do not only specify our target MOE, but also our desired level of assurance. The assurance is the probability that our obtained MOE will not exceed our target MOE. Thus, if the assurance is .80 and our target MOE is f = .50, we have a probability of 80% that our obtained MOE will not exceed f = .50.
The only part of the app you need for sample size planning is the “Sample size planning”-form. Specify f, and the assurance, and the app will give you the desired sample size.
If you do that with the default values f = .50 and Assurance  = .80, the app will give you the following results on the Planning Results-tab:  Sample Size: 36.2175, Expected MOE (f): 0.46. This tells you that you need to sample 37 participants (for instance) per group and then the Expected MOE (the MOE you will get on average) will equal 0.46 (or even a little less, since you sample more than 36.2175 participants).
The Planning-Results-tab also gives you a figure for the power of the t-test, testing the NHST nil-hypothesis for the effect size (Cohen’s d) specified in the “Set population values”-form. Note that this form, like the rest of the app provides details that are not necessary for sample size planning for precision, but make the theoretical concepts clear. So, let’s turn to those details.

The population

Even though it is not at all necessary to specify the population values in detail, considering the population helps to realize the following. The sample size calculations and the figures for expected MOE and power, are based on the assumption that we are dealing with random samples from normal populations with equal variances (standard deviations).
From these three assumptions, all the results follow deductively.  The following is important to realize:  if these assumptions do not obtain, the truth of the (statistical) conclusions we derive by deduction is no longer guaranteed. (Maybe you have never before realized that sample size planning involves deductive reasoning; deductive reasoning is also required for the calculation of p-values and to prove that 95% confidence intervals contain the value of the population parameter in 95% of the cases; without these assumptions is it uncertain what the true p-value is and whether or not the 95% confidence interval is in fact a 95% confidence interval).

In general, then, you should try to show (to others, if not to yourself) that it is reasonable to assume normally distributed populations, with equal variances and random sampling, before you decide that the p-value of your t-test, the width of your confidence interval, and the results of sample size calculations are believable.

The populations in the app are normal distributions. By default, the app shows two such distributions. One of the distributions, the one I like to think about as corresponding to the control condition, has μ = 0, the other one has μ = 0.5. Both distributions have a standard deviation (σ = 1). The standardized difference between the means is therefore equal to δ = 0.50.

The default populations are presented in Figure 1 below.

 Figure 1: Two normal distributions. The distribution to the left has μ = 0, the one to the right has μ = 0.5 The standard deviation in both distributions equals σ = 1. The standardized difference δ and the unstandardized difference between the means both equal 0.50.

The sampling distribution of the mean difference

The other default setting in the app is a sample size (per group) of n = 20.  From the sample size and the specification of the populations, we can deduce the probability density of the different values of the estimates of the difference between the population means. The estimate is simply the difference between the sample means.

This so-called sampling distribution of the mean difference is depicted on the tab next to the population. Figure 2 shows what the sampling distribution looks like if we repeatedly draw random samples of size n = 20 per group from our populations and keep track of the difference between the sample means we get in each repetition.

 Figure 2: Sampling distribution of the difference between two sample means based on samples of n = 20 per group and random sampling from the populations described in Figure 1.

Note that the mean of the sampling distribution equals 0.5 (as indicated by the middle vertical line). This is of course the (default) difference between the population means in the app. So, on average, estimates of the population difference equal the population difference.

The lines to the left and the right of the mean indicate the mean plus or minus the Margin of Error (MOE). The values corresponding to the lines are 0.5 ± MOE. 95% of estimates of the population mean difference have a value between these lines.

Conceptually, the purpose of planning for precision is to decrease the (horizontal) distance between these lines and the population mean difference. In other words, we would like the left and right lines as close to the mean of the distribution as is practically acceptable and possible.

The distribution of the t-statistic

The tab next to the sampling distribution tab contains a figure representing the sampling distribution of the t-statistic. The sampling distribution of t can be deduced on the basis of the population values and the sample size.  In the app, it is assumed that t is calculated under the assumption that the null-hypothesis of zero difference between the means is true. The sampling distribution of t is what you get if you repeatedly sample from the populations as specified, calculate the t-statistic and keep a record of the values of the t-statistic.

The sampling distribution of the t-statistic presented in Figure 3 contains two vertical lines. These lines are located (horizontally) on the value of t that would lead to rejection of the null-hypothesis of equal population means. In other words, the lines are located at the critical value of t (for a two-tailed test).

 Figure 3: Distribution of the t-statistic testing the null-hypothesis of equal population means. The distribution is based on sampling from the populations described in Figure 3. The sample size is n = 20 per group. The lines represent the critical value of t for a two sided t-test. The area between the vertical lines is the probability of a type II error. The combined areas to the left of the left line and to the right of the right line is the power of the test.

The area between the lines is the probability that the null-hypothesis will not be rejected. In the case of a true population mean difference (which is the default assumption in the app), that probability is the probability of an error of the second kind: a type II error.

The complement of that probability is called the power of the test. This is, of course, the area to the left of the left vertical line added to the area to the right of the right vertical line. Conceptually, the power of the test is the probability of rejecting the null-hypothesis when in fact it is false.

Figure 3 clearly demonstrates that if the true mean difference equals 0.50 and the sample size (per group) equals n = 20, that there is a large probability that the null-hypothesis will not be rejected. Actually, the probability of a type II error equals .66. (So, the power of the test is .34).

Sample size planning for precision

With respect to sample size planning for precision, the app by default takes half of a standard deviation (f = .50) as the target MOE. Besides, planning is with 80% assurance. This means that the default settings search for a sample size (per group), so that with 80%  probability MOE will not exceed 0.50 (Note that the default value of the standard deviation is 1, so an f of .50 corresponds to a target MOE of  0.50 on the scale of the data; Likewise, were the standard deviation equal to 2, an f of .50 would correspond to a target MOE of 1.0).

As described above, planning with the default values gives us a sample size of  n = 37 per group, with an expected MOE of 0.46. In the tab next to the planning results, a figure displays what you can expect to find on average, given the planned sample size and the specification of the population. That figure is repeated here as Figure 4.

 Figure 4: Expected results in terms of point and interval estimates (95% confidence intervals). This is what you will find on average given the population specification in Figure 1 and using the default values for sample size planning.

Figure 4 displays point and interval estimates of the group means and the difference between the means. The interval estimates are 95% confidence intervals. The figure clearly shows that on average, our estimate of the difference is very imprecise. That is, the expected 95% confidence interval ranges from almost 0 (0.50 – 0.46 = 0.04) to almost 1 (0.50 + 0.46 = 0.96). Of course, using n = 20, would be worse still.

A nice thing about the app (well, I for one think it’s pretty cool) is that as soon as you ask for the sample sizes, the sample size in the set population values form is automatically updated. Most importantly, this will also update the sampling distribution graphs of the difference between the means and the t-statistic. So, it provides an excellent way of showing what the updated sample size means in terms of MOE and the power of the t-test.

Let’s have a look at the sampling distribution of the mean difference, see Figure 5.

 Figure 5: Sampling distribution of the mean difference with n = 37 per group. Compare with Figure 2 to see the (small) difference in the Margin of Error compared to n = 20.

If you compare Figures 5 and 2, you see that the vertical lines corresponding to the mean plus and minus MOE have shifted somewhat towards the mean. So here you can see, that almost doubling the sample size (from 20 to 37) had the desired effect of making MOE smaller.

I would like to point out the similarity between the sampling distribution of the difference and the expected results plot in Figure 4. If you look at the expected results for our estimate of the population difference, you see that the point estimate corresponds to the mean of the sampling distribution, which is of course equal to the populations mean difference and that the limits of the expected confidence interval correspond to the left and right vertical lines in Figure 5. Thus, on average the limits of the confidence interval correspond to the values that mark the middle 95% of the sampling distribution of the samples mean difference.

Since we specified an assurance of 80%, there is an 80% probability that in repeated sampling from the populations (see Figure 1) with n = 37 per group, our (estimated) MOE will not exceed half a standard deviation. Thus, whatever the true value of the populations mean difference is, there is a high probability that our estimate will not be more than half a standard deviation away from the mean. This is, I think, one of the major advantages of sample size planning for precision: we do not have to specify the unknown population mean difference. This is in contrast to sample size planning for power, where we do have to specify a specific population mean difference.

Speaking of power, the results of the sample size planning suggest that for our specification of the populations mean difference (Cohen’s delta = 0.50) the power of the test equals 0.56. Thus, there is a probability of 56% that with n = 37 per group the t-test will reject. The probability of a type II error is therefore 44%.

Figure 6 shows the distribution of the t statistic with n = 37 per group and a standardized effect size of 0.50.

 Figure 6. The distribution of the t-statistic testing the null-hypothesis of equal population means. The distribution is based on the population specification in Figure 1 and sample sizes of n = 37 per group, with true effect size equal to 0.50. The probability of a type II error is the area of under the curve between the two vertical lines. The power is the area under the curve beyond the two lines. Compare with Figure 3 to see the differences in these probabilities compared to n = 20.

Power versus precision

Now suppose that the unstandardized mean difference between the population means equals 2 and that the standard deviation equals 2.5.  I just filled in the set population values form, setting the mean of population 2 to 2.0 and the standard deviation to 2.5. And I clicked set values.

Let us plan for a target MOE of  f = 0.5 standard deviations with 80% assurance. Click get sample sizes in the sample size planning form. In this case, target MOE equals 1.25.

The results are not very surprising. Since the f did not change compared to the previous time, the results as regards the sample size are exactly the same. We  need n = 37. Again, this is what I like about sample size planning, no matter what the unknown situation in the population is, I just want my margin of error to be no more than half a standard deviation (for example).

But the power did change (of course). Since the standardized population mean difference is now 0.80 (= 2.0 / 2.5) in stead of 0.50, and all the other specifications remained the same, the power increases from 56% to 92%. That’s great.

However, the high probability of rejecting the null-hypothesis does not mean that we get precise estimates. On average, the point estimate of the difference equals 2 and the 95% confidence limits are  0.85 and 3.15 (the point estimate plus or minus 0.46 times the standard deviation of 2.5). See Figure 7.

 Figure 7: Expected results using n = 37 when sampling from two normal populations with equal standard deviations (σ = 2.5) and mean difference of 2.0. The standardized effect size equals 0.80. Note the imprecision of the estimates even though the power of the t-test equals .92.

In short, even though there is a high probability of  (correctly) rejecting the null-hypothesis of equal population means, we are still not in the position to confidently conclude what the size of the difference is: the expected confidence interval is very wide.

Planning for a precise contrast estimate: the mixed model case

In a previous post (here), we saw how we can determine sample size for obtaining, with assurance, a precise interaction contrast estimate. In that post we considered a 2 x 2 factorial design. In this post, I will extend the discussion to the mixed model case. That is, we will consider sample size planning for a precise interaction estimate in case of a design with 2 fixed factors and two random factors: participant and stimulus (item). (A pdf version of this post can be found here: view pdf. )

In order to keep things relatively simple, we will focus on a design where both participants and items are nested under condition. So, each treatment condition has a unique sample of participants and items. We will call this design the both-within-condition design  (see, for instance, Westfall et al. 2014, for detailed descriptions of this design). We will analyse the 2 x 2 factorial design as a single factor design (the factor has a = 4 levels) and formulate an interaction contrast.

Let’s start with p participants and q stimuli. We randomly assign n= p/a participants and m = q/a stimuli to each of the a treatment levels. The ANOVA table for the design is presented in Table 1.

We will use the ANOVA table to illustrate a few concepts that are important to consider when analysing data using mixed modeling. Maybe you remember that in the previous post, we used the ANOVA source table to obtain an expression for the variance of a contrast. In particular, we used the error variance (MSerror) that is also used to form the F-ratio for testing the interaction effect.

Obtaining an apropriate error term

Now, the inclusion of the second random factor (i.e. stimulus in addition to participant) leads, in comparison to the design in the previous post, to a complication. In order to see this, take a look again at the ANOVA table we get when we use SPSS univariate (see Figure 1 and SPSS syntax below). (Important: do not use SPSS GLM Univariate for estimating contrasts in this design; the procedure uses the incorrect standard error; I am using the procedure now just for illustrating a few key concepts).

UNIANOVA score BY cond pp ss
/RANDOM=pp ss
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/CRITERIA=ALPHA(0.05)
/DESIGN=cond pp WITHIN cond ss WITHIN cond
/CONTRAST(cond) = SPECIAL(1, -1, -1, 1).

 Figure 1: SPSS GLM ANOVA output
We can see that the effect of condition is not tested against MSError but against an errorterm formed by linearily combining MSpp, MSss, and MSerror. In particular, MSpp and MSss are added and MSerror is subtracted. See footnote a below the source table. It’s a bit hard to explain why that is done, but I’ll have a go at an explanation nonetheless.

Take a look at Table 1 and focus on the Participant row. The expected Mean Square (EMS) associated with Participant is . Now, suppose that due to some freak accident of nature there are no differences in the mean scores (averaged of stimuli) of each participant. In that case, . This means that under these circumstances the expected mean square associated with participants is simply an estimate of the error variance with degrees of freedom, because , if . Of course, the other estimate of the Error variance is MSError and this estimate is based on degrees of freedom. The logic of the F-test is that under the null-hypothesis, in our case that , the ratio of these two estimates of the error variance follows an F-distribution with and degrees of freedom.

Now focus on the Treatment row in Table 1. The expected mean square associated with Treatment equals . If we now suppose that there is no difference between the treatment means, that is , MSTreatment does not estimate , but . Note that no other source of variance has an expected mean square that is equal to the latter figure. That is, in contrast to our test of the Participant factor, where under the null-hypotheses two Mean Squares estimate the error variance, i.e. MSParticipant and MSError, no mean square is available to form an F-ratio to test the Treatment effect.

But a linear combination of MSParticipant, MSStimulus and MSError, does provide an estimate with expected value . Namely, the sum of the participant and stimulus mean squares minus mean square error: . It is exactly this linear combination of mean squares that is used in the F-ratio to obtain an error term against which to test the Treatment effect in Figure 1: . We will also use this figure to obtain the variance (and standard error) of our contrast estimate.

Degrees of freedom

If you take a closer look at Figure 1, in particular the degrees of freedom column, you will notice that the degrees of freedom associated with the error term that is used to test the Treatment effect is a fractional number and not a nice round number that you would expect to get if you only consider the degrees of freedom in Table 1. The cause of this fractional number is that we cannot simply use the degrees of freedom of the mean square used to test the treatment effect, because that mean square does not exist. Indeed, we had to combine three mean squares in order to obtain an estimate of the error term for the Treatment effect. The consequence of this is that we will have to use an approximation of the degrees of freedom associated with that error term.

SPSS (and my precision app) use the Satterthwaite procedure to approximate the degrees of freedom of the error term. That approximation is as follows (notice that the numerator is equal to the linear combination of mean squares used to obtain the error term).

Thus, using the results in Figure 1.

MSp = 6.403
MSs = 10.137
MSe = 1.470
dfp = 44
dfs = 20
dfe = 220
df = (MSp + MSs - MSe)^2 / (MSp^2/dfp + MSs^2/dfs + MSe^2/dfe)
df

## [1] 37.35559


The margin of error of a contrast estimate

Now that we have obtained the error variance of a treatment effect by using a linear combination of mean squares and a Satterthwaite approximation of the degrees of freedom we are able to figure out the margin of error (MOE) of our contrast estimate. Just as in the simple between subjects design we discussed previously we obtain MOE by multiplying the standard error of the estimate with a critical value of t. The critical value of t is the .975 quantile of the central t-distribution with the Satterthwaite approximated degrees of freedom (if you are looking for something other than a 95% confidence interval, you will have to use another critical value, of course). The following code gives the critical value of t for a 95% confidence interval (change the value of C if you want something other than a 95% interval).

C = .95
alpha = 1 - C
critT = qt(1 - alpha/2, df)
critT

## [1] 2.025542


The standard error of the contrast estimate can be obtained as follows.

where I have used the symbol to refer to the relative error variance of the treatment mean (which in this design is equal to the absolute error variance, but that’s another story), and refers to the contrast weight of treatment mean i. The relative error variance of the treatment mean is obtained by dividing the error variance that is used to test the treatment effect by the total number of observations in each treament, . Thus, using the results in Figure 1.

If we want to estimate an interaction contrast for the 2 x 2 design, we may, for example, specify contrasts weights . Let’s use the results in Figure 1 to calculate what MOE is for this particular contrast.

#sample sizes per treatment
n = 12
m = 6

#obtained mean squares (see Figure 1):
MSp = 6.403
MSs = 10.137
MSe = 1.470

#Relative error variance:
VarT = (MSp + MSs - MSe) / (n*m)

#contrast weights:
weights = c(1, -1, -1, 1)

#standard error of contrast estimate
SEcontrast = sqrt(sum(weights^2)*VarT)

#Satterthwaite degrees of freedom:
dfp = 44
dfs = 20
dfe = 220
df = (MSp + MSs - MSe)^2 / (MSp^2/dfp + MSs^2/dfs + MSe^2/dfe)

#critical T
critT = qt(.975, df)

#Margin of Error
MOE = critT * SEcontrast

SEcontrast; MOE

## [1] 0.9149985

## [1] 1.853368


SPSS GLM Univariate uses the wrong standard error for a mixed model contrast estimate

Even though SPSS GLM Univariate allows you to specify a mixed model design and tests the treatment effect with a linear combination of mean squares, the procedure does not use the correct error variance if you want to estimate the value of a contrast (using the CONTRAST subcommand), it uses MSError instead. In our example, then, SPSS uses an error variance that is an order of magnitude smaller than the correct error variance: , with 220 degrees of freedom and not , with 37.357 (see Figure 1). The consequence of this is, of course, that the 95% CI is much narrower than it should be.

Running the syntax above Figure 1 gives the output in Figure 2. The results can be reproduced as follows. The standard error of the contrast is the result of , the critical value of t is the .975 quantile of the central t distribution with , which equals . The value of MOE is therefore . With a contrast estimate of , the 95% CI equals . In comparison, using the correct value of MOE gives us .

 Figure 2: SPSS GLM Univariate Contrasts Output
Thus, even though SPSS GLM Univariate gives us the ingredients to work with, i.e. an estimate of the error variance and approximate degrees of freedom, it should not be used for obtaining contrast estimates if you have a mixed model. SPSS Mixed does a much better job and the MIXED output also contains other useful data we can use for sample size planning. (In practice, I use the linear mixed effects modeling package LME4) and not so much SPSS). Have a quick look at Figure 3 for the contrast estimate obtained with the mixed procedure. (Note how the numbers are essentially the same as the ones we obtained when using the ANOVA source table of SPSS GLM Univariate (Figure 1)).
MIXED score BY cond
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1)
SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE)
LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED= cond | SSTYPE(3)
/METHOD=REML
/TEST= 'interaction' cond 1 -1 -1 1
/RANDOM=INTERCEPT | SUBJECT(pp) COVTYPE(VC)
/RANDOM=INTERCEPT | SUBJECT(ss) COVTYPE(VC).

 Figure 3: Contrast Estimate SPSS Mixed

Planning for precision

Even though the result in Figure 3 is hard to interpret without substantive detail (the data are made up) it is clear that the precision of the estimate is, well, suboptimal. As an indication: the estimated within treatment standard deviaion is about 1.74, so the estimated difference between differences (interaction) is close to a value of Cohen’s d of , approximate 95% CI , which according to the rules of thumb is a medium negative effect, but consistent with anytihing from a huge negative effect to a large positive effect in the population, as the approximate CI shows. (I have divided the point estimate and the confidence interval in Figure 2 by 1.74, to obtain Cohen’s d and an approximate confidence interval). Clearly, then, our precision can be optimized.

Suppose that you are very fond of the both-within-condition design (BwC-design) and you plan to use it again in a replication study, You could of courseopt for a design with better expected precision, but based on the data and estimates at hand, that involves a lot of assumptions, but I will show you how you can do it in one of the next posts. If you plan for precision using the BwC-design, you need the following ingrediënts.

1. A figure for your target MOE. Let’s set target MOE to .40.

2. A specification of the percentage of assurance. Let’s say we want 80% assurance that target MOE will not exceed .40.

3. Estimates (or guesstimates) of the person variance , the stimulus variance , and the error variance . We will have a look in the next section.

4. Functions for calculating the relative error variance, degrees of freedom, MOE and determining the required sample sizes for Participants and Stimuli. These are all present in the Precision App, so I will use the application, but I will show how the results of the sample sizes relate to the information above.

Obtaining estimates of the variance components

We need to specify the values of three variance components. These variance components can be estimated on the basis of the mean squares and sample sizes obtained with SPSS GLM Univariate, we can use SPSS MIXED to obtain direct estimates or any other way to estimate variance components, such as GLM VARCOMPS (which has several estimation procedures). I like to use SPSS MIXED or LME4. and not a dedicated program for variance components, because most of the times the main purpose of the analysis I am doing is obtaining contrast estimate or F-tests, so most of the times variance components estimates are a handy by-product of my main analysis. For demonstrative purposes, I will show how it can be done with the GLM univariate output and I will show how the results match those of SPSS MIXED.

Take a look at Figure 1. The estimate of is simply MS(Error) = 1.47. For obtaining an estimate for the variance component associated with participants, we set the obtained mean square equal to the expected mean square (see Table 1). Thus, . Rearranging and using 1.47 as an estimate for leads to . Likewise, the estimate for . Thus, our estimates are , , and .

In order to obtain direct estimates you can use SPSS Mixed (or GLM Varcomps, or whatever you like). If you run the SPSS syntax in Figure 3, you will find estimates of the variance components under the heading Covariance Parameters in your SPSS output. See Figure 4. Note that the standard errors are pretty large, so the point estimates are not very precise. But since it is the only information we have, we will consider the point estimates to be the best we have.

 Figure 4: Variance components estimates

Getting sample sizes with the Precision application

Let’s use the Precision app (https://gmulder.shinyapps.io/precision/) for sample size planning. Set the design to Stimulus and Participant within condition, the number of conditions to 4 and in the options for contrast 3 fill in the weights (Note: it is not necessary to fill it in in contrast 3).

For target MOE fill in , for assurance the value .80 and the values , , and for, respectively, Residual variance, Participant intercept variance and Stimulus intercept variance. Fill in the value 0 for all the other variances. See Figure 5.

 Figure 5: Setting values in the Precision App.

Press the button “Get Sample Sizes”. The calculations take a while, so make yourself some coffee (or anything else you like) and when you return the screen should show something like Figure 6a.
 Figure 6a: Output for planning target MOE = 0.40

 Figure 6b: Outpur for planning target MOE = 0.50

By the way, if you wonder why you can simply set the three interaction variance components to zero, then it may be nice to know that the variance components estimates obtained from the both-within-condition design already include them. For example, the estimate of the resiidual variance obtained with the both-within-condition design is actually the sum of the residual variance component and the interaction conponent of participant and stimulus. These latter components can only be separated in a fully-crossed-design where all participants respond to all stimuli in all conditions. Thus, if we use the symbol , to refer to the residual variance in the BwC-design, we can say . Normally, the precision app sums these two components to get a value for the residual variance in the BwC-design, and you will obviously get the same result if you specify the residual variance as the sum and the participant-by-stimulus variance as 0. Likewise, , and , where and are the variances associated with the interaction of treatment and participant and treatment and stimulus, respectively.

If you look at the sample sizes in Figure 6a, you may notice that the numbers look odd. For example, the app says that the smallest number of stimuli is 877 but it also says that you only need 500 stimuli if you select 802 participants. And something like that happens to the participants as well. The output says that the smaller number of participants is 802, but it also suggest using 500 of them if you use 877 stimuli, which is clearly smaller than 802. To me this seems a little inconsistent. But I think I figured out what’s going on. The reason for these inconsistencies is that the application minimizes the sample sizes, but with a maximum of 500 for the other sample sizes. So, the smallest number of stimuli is 877 given that the maximum number of participants is 500. In other words, a smaller sample size is possible, but then we have to increase the maximum number of participants. In other words, in order to have 80% assurance to obtain a target MOE of no more than .40, we need at least 500 stimuli or at least 500 participants. If you look at Figure 6b, you will not notice these inconsistencies. The difference between the left and right sample sizes is that sizes on the right are based on a target MOE of .50 instead of .40.

According to Figure 6a, we can obtain our target if we use 802 participants and 500 stimuli. Since we are planning for an experiment with 4 treatment conditions, these total sample sizes need to be divided by 4 to get the sample sizes per treatment conditions. Thus, n = 804/4 = 201 participants, and m = 500 / 4 = 125 stimuli per treatment condition (I’ve increased the participants sample size to make it divisible by 4). For many experiments these numbers are impractically large, of course, so in this case you would probably either consider an alternative design or else you have to live with the message that you may not get the precision you want or need.

Checking the sample size suggestions using what we know

If we fill in the sample sizes (802 participants and 500 stimuli) in the Precision app we get the results presented in Figure 7 for the interaction contrast (contrast 3). Expected MOE equals , and there is 80% assurance that MOE will not exceed . Note, again, that the assurance MOE is somewhat larger than target MOE, because a sample of 804 participants requires a sample of more than 500 stimuli to get the target MOE with 80% assurance and 500 stimuli is the maximum number of stimuli the app considers when minimizing the number of participants.

 Figure 7: Expected and Assurance MOE for the interaction contrast (contrast 3) using 804 participants and 500 stimuli
Let’s see if we can reconstruct the figures using what we know from previous sections. First the relative error variance of the treatment mean. That relative error variance is

The degrees of freedom can be calculated by first filling in the expected mean squares and the degrees of freedom presented in Table 1: , , , , , and . The Satterthwaite degrees of freedom are . The standard error of the contrast equals . The critical value for t equals . Expected MOE is, therefore, (the tiny difference with the results from the app is due to rounding errors).

For the calculation of assurance MOE we need to take the sampling distribution of the relative error variance of the treatment mean into account. The app uses the (scaled) -distribution. That is, we assume with assurance , that the quantile of the sampling distribution of the relative error error variance is . Now, the degrees of freedom are , the assurance , and the .80 quantile of with 1092.66 degrees of freedom equals . Since the relative error variance equals , the .80 quantile of the error variance equals . And this means that assurance MOE equals . Again, the difference with the results from the Precision App are due to rounding error.

The omnibus F-test may be ignored if you use multiple comparison procedures

I think  trying to be scientific with a small s involves asking critical questions about  common wisdom or common practice. In this post, I would like to focus on multiple comparisons in the context of ANOVA. What does common practice indicate?

Common wisdom suggests doing multiple comparisons only if the F-test is significant

Let’s have a look on some practical advice considering multiple comparisons found on the web (R-bloggers.com) and in Field (2015).

“One way to begin an ANOVA is to run a general omnibus test. The advantage to starting here is that if the omnibus test comes up insignificant, you can stop your analysis and deem all pairwise comparisons insignificant. If the omnibus test is significant, you should continue with pairwise comparisons” (https://www.r-bloggers.com/r-tutorial-series-one-way-anova-with-pairwise-comparisons/)

“When we have a statistically significant effect in ANOVA and an independent variable of more than two levels, we typically want to make follow-up comparisons. There are numerous methods for making pairwise comparisons and this tutorial will demonstrate how to execute several different techniques in R.” (https://www.r-bloggers.com/r-tutorial-series-anova-pairwise-comparison-methods/)
And have a look at how the text book I used to use in my statistics course explains it.

“It might seem a bit unhelpful that an ANOVA doesn’t tell you which groups are different from which, given that having gone to the trouble of running an experiment, you probably need to know more than ‘there’s some difference somewhere or other’. You might wonder, therefore, why we don’t just carry out a lot of t-tests, which would tell us very specifically whether pairs of group means differ. Actually, the reason has already been explained in Section 2.1.6.7: every time you run multiple tests on the same data you inflate the potential Type I errors that you make. However, we’ll return to this point in Section 11.5 when we look at how we follow up an ANOVA to discover where the group difference lie.” (Field, 2015, p. 442).
Although, in honesty, on p. 459 Field writes:

“The least significance difference (LSD) pairwise comparison makes no attempt to control Type I error and is equivalent to performing multiple t-tests on the data. The only difference is that LSD requires the overall ANOVA to be significant.”

This is meant to inform about the relative merits of one post hoc procedure to another in terms of Type I and Type II error.  Crucially, it is not mentioned that the other post-hoc procedures require that the overall ANOVA be significant. (As common wisdom seems to suggest). However,  his flow-chart of the ANOVA procedure (p. 460) clearly suggests multiple comparison procedures should be used as post-hoc procedures (after the ANOVA is significant).

Thus, common “statistical” wisdom seem to suggest that multiple comparison procedures are to be used as post hoc procedures following up a significant omnibus F-test. And the reason is that this two-stepped procedure minimizes the probability of type I errors.

Now, let’s ask ourselves whether this common sense is, well, sensible.

Multiple comparisons only after significant F-test affects power negatively

Wilcox (2017) contains some useful information regarding our question. In his discussion of the much used Tukey-HSD procedure (the Tukey-Kramer Method), he references Bernhardson, (1975) who shows that the probability of at least 1 type I error among pairwise comparisons of estimates of equal population means (i.e. true null-hypotheses) is no longer equal to if the procedure is only carried out following a significant omnibus test. That is, if we use our beloved two step procedure.

The consequence of the two step procedure for the Tukey-HSD is that is reduced. Thus, if we want our multiple comparisons procedure to generate one type I error or more at most with a probability of  , using the 2 step procedure leads to a lowered . This is of course, bad news, because in the event that not all of the null-hypotheses are true, lowering increases , the probability of not rejecting when the null-hypothesis is false (keeping the sample size constant, of course). In other words, the two step procedure decreases the power of the multiple comparison procedure.

In the words of Wilcox (2017):

“In practical terms, when it comes to controlling the probability of at least one type I error, there is no need to first reject with the ANOVA F test to justify using the Tukey-Kramer method. If the Tukey-Kramer method is used only after the F test rejects, power can be reduced. Currently, however, common practice is to use the Tukey-Kramer method only if the F-test rejects. That is, the insight reported by Bernhardson is not yet well known.”  (p. 385).

In conceptual terms,  the fact that the probability of at least one type I error in the multiple comparison procedure is  smaller than if the F-test rejects is pretty clear, at least to me it is. Suppose we reject if the p-value of the F-test is smaller or equal to 5%. This will also be the probability that we conduct the multiple comparison test over repeated replications of the same experiment. Of that 5%, not every application of the procedure will result in at least one type I error. Indeed, a puzzling fact for many beginning researchers is that the F-test is significant while none of the pairwise comparisons is. In other words, some of those 5%  of the cases in which we perform the procedure following a significant F-test will probably not reject any of the pairwise null-hypotheses, unless it is guaranteed that at least one type I error per application will be made.

(With no adjustment of for multiple comparisons, this will happen (with high probability so no guarantee) if a huge number of pairwise comparisons are made. For instance, with 99 unadjusted multiple comparisons the probability of at least one type I error is 99%.; this is why it makes sense to demand that the F-test is significant before testing multiple comparisons with the LSD procedure. Although the latter seems to run into trouble with more than 3 groups (Wilcox, 2017).

A quick simulation study

My hunch is that the two-step procedure is unnecessary for the Tukey-Kramer method as well as for other multiple comparison procedures (the exception Fisher’s LSD procedure which was designed as a post hoc procedure to be used as a follow up after a significant F-test, as Field (2015) rightly points out), but I only focused on the Tukey-Kramer method. What I did was a simple simulation study with a four group between subjects design (all ‘s equal) and estimated the probability of at least type I error both with and without using the 2 step procedure.

set.seed(456)
#number of groups
ngr = 4

#number of participants
n = 40

#group is a factor
gr <- factor(rep(1:ngr, each=n))

#vector for storing rejections F-test
Reject <- rep(0, 10000)

#vector for storing #rejections multiple
#comparisons
RejectHSD <- rep(0, 10000)

for (i in 1:10000) {

y = rnorm(ngr*n)
mod = aov(y ~ gr)
Reject[i] = anova(mod)$"Pr(>F)"[1] <= .05 PS <- TukeyHSD(mod)$gr[,4]
RejectHSD[i] = sum(PS <=.05)
}

#probability type I error F-test
sum(Reject)/length(Reject)

## [1] 0.0515

#probability at least one type I error Tukey HSD
sum(RejectHSD > 0) / length(RejectHSD)

## [1] 0.0503

#probability at least one type I error given F-tests Rejects
sum(RejectHSD[Reject==TRUE] > 0) / length(RejectHSD)

## [1] 0.0424


Even though a single (relatively tiny) simulation (which, by the way, takes a long time to run, nonetheless), is not necessarily convincing, it does  illustrate the main points of this post. First, the probability of at least one incorrect rejection using the TukeyHSD function is close to .05. With this particular random seed it even performs a little better than the ANOVA F-test: .0503 versus .0515. This illustrates that even without considering whether the omnibus test is significant the main demand of not rejecting too many true null-hypotheses is completely satisfied. So, in practical terms, you can safely ignore the omnibus test if your concerns are about  .

Second, the probability of incorrectly rejecting at least one true pair-wise null-hypothesis after the ANOVA F-test is significant is estimated to be .0424. This shows, that the two-step procedure leads to a larger decrease in the actual type I error probability than is wanted. Even though this may seem good news from the perspective of avoiding type I errors, the down side is that pair wise null-hypotheses that are false (and potentially important) may not be detected.

Conclusion

Common wisdom and practice suggest that multiple comparisons procedures should be done only after a significant omnibus test. We have seen that this is not at all necessary if we use a multiple comparisons procedure that is designed to control the type I error probability. To my knowledge, most of the procedures conventionally thought of as post hoc tests are designed in this manner, the exception being the LSD procedure which does require a significant F-test. For practical purposes, then, do not bother with the omnibus test (note the exception) if you are planning to pair wise compare all the treatment means.
This practical advice does not mean, of course, that I am suggesting you spend your time comparing all treatment means. Most of the time, focused comparisons are a more fruitful way of analysing your data. But I’ll leave that topic for another time.

References

Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. 4th Edition. London: Sage.
Wilcox, R. (2017). Understanding and Applying Basic Statistical Methods Using R. Hoboken, NJ: Wiley,

Planning for a precise slope estimate in simple regression

In this post, I will show you a way of determining a sample size for obtaining a precise estimate of the slope of the simple linear regression equation . The basic ingredients we need for sample size planning are a measure of the precision, a way to determine the quantiles of the sampling distribution of our measure of precision, and a way to calculate sample sizes.

As our measure of precision we choose the Margin of Error (MOE), which is the half-width of the 95% confidence interval of our estimate (see: Cumming, 2012; Cumming & Calin-Jageman, 2017; see also www.thenewstatistics.com).

The distribution of the margin of error of the regression slope

In the case of simple linear regression, assuming normality and homogeneity of variance, MOE is , where , is the .975 quantile of the central t-distribution with degrees of freedom, and is the standard error of the estimate of .
An expression of the squared standard error of the estimate of is (Wilcox, 2017): the variance of Y given X divided by the sum of squared errors of X. The variance equals , the variance of Y multiplied by 1 minus the squared population correlation between Y and X, and it is estimated with the residual variance , where .
The estimated squared standard error is given in (1)

(1)

With respect to the sampling distribution of MOE, we first note the following. The distribution of estimates of the residual variance in the numerator of (1) is a scaled -distribution:

thus

Second, we note that

where , therefore

Alternatively, since , and multiplying by 1 ().

In terms of the sampling distribution of (1), then, we have the ratio of two (scaled) distributions, one with degrees of freedom, and one with degrees of freedom. Or something like:

which means that the sampling distribution of MOE is:

(2)

This last equation, that is (2), can be used to obtain quantiles of the sampling distribution of MOE, which enables us to determine assurance MOE, that is the value of MOE that under repeated sampling will not exceed a target value with a given probability. For instance, if we want to know the .80 quantile of estimates of MOE, that is, assurance is .80, we determine the .80 quantile of the (central) F-distribution with N – 2 and N – 1 degrees of freedom and fill in (2) to obtain a value of MOE that will not be exceeded in 80% of replication experiments.
For instance, suppose , , , , and assurance is .80, then according to (2), 80% of estimated MOEs will not exceed the value given by:
vary = 1
varx = 1
rho = .5
N = 100
dfe = N - 2
dfx - N - 1
assu = .80
t = qt(.975, dfe)
MOE.80 = t*sqrt(vary*(1 - rho^2)*qf(.80, dfe, dfx)/(dfx*varx))
MOE.80

## [1] 0.1880535


What does a quick simulation study tell us?

A quick simulation study may be used to check whether this is at all accurate. And, yes, the estimated quantile from the simulation study is pretty close to what we would expect based on (2). If you run the code below, the estimate equals 0.1878628.
library(MASS)
set.seed(355)
m = c(0, 0)

# note: s below is the variance-covariance matrix. In this case,
# rho and the cov(y, x) have the same values
# otherwise: rho = cov(x, y)/sqrt(varY*VarX) (to be used in the
# functions that calculate MOE)
# equivalently, cov(x, y) = rho*sqrt(varY*varX) (to be used
# in the specification of the variance-covariance matrix for
#generating bivariate normal variates)

s = matrix(c(1, .5, .5, 1), 2, 2)
se <- rep(10000, 0)
for (i in 1:10000) {
theData <- mvrnorm(100, m, s)
mod <- lm(theData[,1] ~ theData[,2])
se[i] <- summary(mod)$coefficients[4] } MOE = qt(.975, 98)*se quantile(MOE, .80)  ## 80% ## 0.1878628  Planning for precision If we want to plan for precision we can do the following. We start by making a function that calculates the assurance quantile of the sampling distribution of MOE described in (2). Then we formulate a squared cost function, which we will optimize for the sample sizeusing the optimize function in R. Suppose we want to plan for a target MOE of .10 with 80% assurance.We may do the following. vary = 1 varx = 1 rho = .5 assu = .80 tMOE = .10 MOE.assu = function(n, vary, varx, rho, assu) { varY.X = vary*(1 - rho^2) dfe = n - 2 dfx = n - 1 t = qt(.975, dfe) q.assu = qf(assu, dfe, dfx) MOE = t*sqrt(varY.X*q.assu/(dfx * varx)) return(MOE) } cost = function(x, tMOE) { cost = (MOE.assu(x, vary=vary, varx=varx, rho=rho, assu=assu) - tMOE)^2 } #note samplesize is at least 40, at most 5000. #note that since we already know that N = 100 is not enough #in stead of 40 we might just as well set N = 100 at the lower #limit of the interval (samplesize = ceiling(optimize(cost, interval=c(40, 5000), tMOE = tMOE)$minimum))

## [1] 321

#check the result:
MOE.assu(samplesize, vary, varx, rho, assu)

## [1] 0.09984381


Let’s simulate with the proposed sample size

Let’s check it with a simulation study. The value of estimated .80 of estimates of MOE is 0.1007269 (if you run the below code with random seed 335), which is pretty close to what we would expect based on (2).
set.seed(355)
m = c(0, 0)

# note: s below is the variance-covariance matrix. In this case,
# rho and the cov(y, x) have the same values
# otherwise: rho = cov(x, y)/sqrt(varY*VarX) (to be used in the
# functions that calculate MOE)
# equivalently, cov(x, y) = rho*sqrt(varY*varX) (to be used
# in the specification of the variance-covariance matrix for
# generating bivariate normal variates)

s = matrix(c(1, .5, .5, 1), 2, 2)
se <- rep(10000, 0)
samplesize = 321
for (i in 1:10000) {
theData <- mvrnorm(samplesize, m, s)
mod <- lm(theData[,1] ~ theData[,2])
## [1] 246.4563
##
## $objective ## [1] 8.591375e-18  Thus, according to the optimize function we need 247 participants (per group; total N = 988), to get an expected MOE equal to our target MOE. The expected MOE equals 0.4553, which you can confirm by using the MOE function we made above. Planning with assurance Although expected MOE is close to our target MOE, there is a probability 50% that the obtained MOE will be larger than our target MOE. In other words, repeated sampling will lead to obtained MOEs larger than what we want. That is to say, we have 50% assurance that our obtained MOE will be at least as small as our target MOE. Planning with assurance means that we aim for a certain specified assurance that our obtained MOE will not exceed our target MOE. For instance, we may want to have 80% assurance that our obtained MOE will not exceed our target MOE. Basically, what we need to do is take the sampling distribution of the estimate of Mean Square Error into account. We use the following formula (see also my post introducing the Precision App for the general formulae: https://the-small-s-scientist.blogspot.nl/2017/04/planning-for-precision.html). where is the assurance expressed in a probability between 0 and 1. Let’s do it in R. Again, the function that calculates assurance MOE is tailored for the specific situation, but it is relatively easy to formulate these functions in a generally applicable way, MOE.gamma = function(n) { df = 4*(n-1) MOE = 2*qt(.975, df)*sqrt(3.324/n*qchisq(.80, df)/df) } loss <- function(n) { (MOE.gamma(n) - 0.4558)^2 } optimize(loss, c(100, 1000))  ##$minimum
## [1] 255.576
##
## \$objective
## [1] 2.900716e-18


Thus, according to the results, we need 256 persons per group (N = 1024 in total) to have a 80% probability of obtaining a MOE not larger than our target MOE. In that case, our expected MOE will be 0.4472.

Planning for precision with samples of participants and items

Many experiments involve the (quasi-)random selection of both participants and items. Westfall et al. (2014) provide a Shiny-app for power-calculations for five different experimental designs with selections of participants and items. Here I want to present my own Shiny-app for planning for precision of contrast estimates (for the comparison of up to four groups) in these experimental designs.  The app can be found here: https://gmulder.shinyapps.io/precision/

(Note: I have taken the code of Westfall’s app and added code or modified existing code to get precision estimates in stead of power; so, without Westfall’s app, my own modified version would never have existed).

The plan for this post is as follows. I will present the general theoretical background (mixed model ANOVA combined with ideas from Generalizability Theory) by considering comparing three groups in a counter balanced design.
Note 1: This post uses mathjax, so it’s probably unreadable on mobile devices. Note: a (tidied up) version (pdf) of this post can be downloaded here: download the pdf
Note 2: For simulation studies testing the procedure go here: https://the-small-s-scientist.blogspot.nl/2017/05/planning-for-precision-simulation.html
Note 3: I use the terms stimulus and item interchangeably; have to correct this to make things more readable and comparable to Westfall et al. (2014).
Note 4: If you do not like the technical details you can skip to an illustration of the app at the end of the post.

The general idea

The focus of planning for precision is to try to minimize the half-width of a 95%-confidence interval for a comparison of means (in our case). Following Cumming’s (2012) terminology I will call this half-width the Margin of Error (MOE). The actual purpose of the app is to find required sample sizes for participants and items that have a high probability (‘assurance’) of obtaining a MOE of some pre-specified value.

Expected MOE for a contrast

For a contrast estimate   we have the following expression for the expected MOE.

where is the standard error of the contrast estimate. Of course, both the standard error and the df are functions of the sample sizes.

For the standard error of a contrast with contrast weights through , where a is the number of treatment conditions,  we use the following general expression.

where n is the per treatment sample size (i.e. the number of participants per treatment condition times the number of items per treatment condition) and the within treatment variance (we assume homogeneity of variance).

For a simple example take an independent samples design with n = 20 participants responding to 1 item in one of two possible treatment conditions (this is basically the set up for the independent t-test). Suppose we have contrast weights and , and , the standard error for this contrast equals .  (Note that this is simply the standard error of the difference between two means as used in the independent samples t-test).

In this simple example, df is the total sample size (N = n*a) minus the number of treatment conditions (a), thus . The expected MOE for this design is therefore, . Note that using these figures entails that 95% of the contrast estimates will take values between the true contrast value plus and minus the expected MOE: .

For the three groups case, and contrast weights {}, the same sample sizes and within treatment variance gives .

(If you like, I’ve written a little document with derivation of the variance of selected contrast estimates in the fully crossed design for the comparison of two and three group means. That document can be found here: https://drive.google.com/open?id=0B4k88F8PMfAhaEw2blBveE96VlU)

The focus of planning for precision is to try to find sample sizes that minimize expected MOE to a certain target MOE.  The app uses an optimization function that minimizes the squared difference between expected MOE and target MOE to find the optimal (minimal) sample sizes required.

Planning with assurance

If the expected MOE is equal to target MOE,  the sample estimate of MOE will be larger than your target MOE in 50% of replication experiments. This is why we plan with assurance (terminology from Cumming, 2012).  For instance, we may want to have a 95% probability (95% assurance) that the estimated MOE will not exceed our target MOE.

In order to plan with assurance, we need (an approximation of) the sampling distribution of MOE. In the ANOVA approach that underlies the app, this boils down to the distribution of estimates of

thus

In terms of the two-groups independent samples design above: the expected MOE equals 2.8629. But, with df = 38, there is an 80% probability (assurance) that the estimated MOE will be no larger than:

Note that the 45.07628 is the quantile in the chi-squared (df = 38) distribution. That is .

The app let’s  you specify a target MOE and a value for the desired assurance () and will find the combination of number of participants and items that will give an estimated MOE no larger than target MOE in % of the replication experiments.

The mixed model ANOVA approach

Basically, what we need to plan for precision is to able to specify and the degrees of freedom. We will specify as a function of variance components and use the Satterthwaite procedure to approximate the degrees of freedom by means of a linear combination of expected mean squares. I will illustrate the approach with a three-treatment conditions counterbalanced design.

A description of the design

Suppose we are interested in estimating the differences between three group means. We formulate two contrasts: one contrast estimates the mean difference between the first group and the average of the means of the second and third groups. The weights of the contrasts are respectively {1, -1/2, -1/2}, and {0, 1, -1}.

We are planning to use a counterbalanced design with a number of participants equal to p and a sample of items of size q. In the design we randomly assign participants to a groups, where a is the number of conditions, and randomly assign items to a lists (see Westfall et al., 2014 for more details about this design). All the groups are exposed to all lists of stimuli, but the groups are exposed to different lists in each condition. The number of group by list combinations equals , and the number of observations in each group by list combination equals . The condition means are estimated by combining a group by list combinations each of which composed of different participants and stimuli. The total number of observations per condition is therefore, .

The ANOVA model

The ANOVA model for this design is

where the effect is a constant treatment effect (it’s a fixed effect), and the other effect are random effects with zero mean and variances (participants), (items), (person by treatment interaction), (item by treatment interaction) and (error variance confounded with the person by item interaction). Note: in Table 1 below, is (for technical reasons not important for this blogpost) presented as this confounding .

We make use of the following restrictions (Sahai & Ageel, 2000): , and . The latter two restrictions make the interaction-effects correlated across conditions (i,e. the effects of person and treatment are correlated across condition for the same person, likewise the interaction effects of item and treatment are correlated across conditons for the same item. Interaction effects of different participants and items are uncorrelated). The covariances between the random effects are assumed to be zero.

Under this model (and restrictions) , and . Furthermore, the covariance of the interactions between treatment and participant or between treatment and item for the same participant or item are for participants and for items.

Within treatment variance

In order to obtain an expectation for MOE, we take the expected mean squares to get an expression or the expected within treatment variance . These expected means squares are presented in Table 1.

The expected within treatment variance can be found in the Treatment row in Table 1. It is comprised of all the components to the right of the component associated with the treatment effect (). Thus, . Note that the latter equals the sum of the expected mean squares of the Treatment by Participant () and the Treatment by Item () interactions, minus the expected mean square associated with Error ().

Degrees of freedom

The second ingredient we need in order to obtain expected MOE are the degrees of freedom that are used to estimate the within treatment variance. In the ANOVA approach the within treatment variance is estimated by a linear combination of mean squares (as described in the last sentence of the previous section. This linear combination is also used to obtain approximate degrees of freedom using the Satterthwaite procedure:

1.

Expected MOE

(Note: I can’t seem to get mathjax to generate align environments or equation arrays, so the following is ugly; Note to self: next time use R-studio or Lyx to generate R-html or an equivalent format).

The expected value of MOE for the contrasts in the counter balanced design is:

Finally an example

Suppose we the scores in three conditions are normally distributed with (total) variances . Suppose furthermore, that 10% of the variance can be attributed to treatment by participant interaction, 10% of the variance to the treatment by item interaction and 40% of the variance to the error confounded with the participant by item interaction. (which leaves 40% of the total variance attributable to participant and item variance.

Thus, we have , , and . Our target MOE is .25, and we plan to use the counterbalanced design with p = 30 participants, and q = 15 items (stimuli).

Due to the model restrictions presented above we have , , and .

The value of is therefore, , and the approximate df equal .

For the first contrast, with weights {1, -1/2. -1/2}, then, the Expected value for the Margin of Error is .

For the second contrast, with weights {0, 1, -1}, the Expected value of the Margin of Error is

Thus, using p = 30 participants, and q = 15 items (stimuli) will not lead to an expected MOE larger than the target MOE of .25.

We can use the app to find the required number of participants and items for a given target MOE. If the number of groups is larger than two, the app uses the contrast estimate with the largest expected MOE to calculate the sample sizes (in the default setting the one comparing only two group means). The reasoning is that if the least precise estimate (in terms of MOE) meets our target precision, the other ones meet our target precision as well.

Using the app

I’ve included lot’ of comments in the app itself, but please ignore references to a manual (does not exist, yet, except in Dutch) or an article (no idea whether or not I’ll be able to finish the write-up anytime soon). I hope the app is pretty straightforward. Just take a look at  https://gmulder.shinyapps.io/precision/, but the basic idea is:
– Choose one of five designs
– Supply the number of treatment conditions
– Specify contrast(weights) (or use the default ones)
– Supply target MOE and assurance
– Supply values of variance components (read (e,g,) Westfall, et al, 2014, for more details).
– Supply a number of participants and items
– Choose run precision analysis with current values or
– Choose get sample sizes. (The app gives two solutions: one minimizes the number of participants and the other minimizes the number of stimuli/items). NOTE: the number of stimuli is always greater than or equal to 10 and the number of participants is always greater than or equal to 20.

An illustration

Take the example above. Out target MOE equals .25, and we want insurance of .80 to get an estimated MOE of no larger than .25. We use a counter-balanced design with three conditions, and want to estimate two contrasts: one comparing the first mean with the average of means two and three, and the other contrast compares the second mean with the third mean. We can use the default contrasts.
For the variance components, we use the default values provided by Westfall et al. (2014) for the variance components. These are also the default values in the app (so we don’t need to change anything now).
Let’s see what happens when we propose to use p = 30 participants and q = 15 items/stimuli.
Here is part of a screenshot from the app:
These results show that the expected MOE for the first contrast (comparing the first mean with the average of the other means) equals 0.3290, and assurance MOE for the same contrasts equals 0.3576. Remember that we specified the assurance as .80. So, this means that 80% of the replication experiments give estimated MOE as large as or smaller than 0.3576. But we want that to be at most 0.2500.  Thus, 30 participants and 15 items do suffice for our purposes.
Let’s use to app to get sample sizes. The results are as follows.

The app promises that using 25 stimuli combined with 290 participants or 25 participants and 290 items will do the trick (the symmetry of these results are due to the fact that the interaction components are equal; both the treatment by participant and the treatment by stimulus interaction component equal .10).  Since we have 3 treatment conditions using 290 participants or stimuli is a little awkward, so I suggest to use 291 (equals 97 participants per group or 97 items per list). (300 is a much nicer figure of course). Likewise, as it is hard to equally divide 25 stimuli or participants over three lists or groups, use a multiple of three (say: 27).

If we input the suggest sample sizes in the app, we see the following results if we choose the run precision analysis  with current values.

As you can see: Assurance MOE is close to 0.25 (.24) for the second contrast (the least precise one), so 80% of replication experiments will get estimated MOE of 0.25 (.24) or smaller. The expected precision is 0.22. The first contrast (which can be estimated with more precision) has assurance MOE of 0.21 and expected MOE of approximately 0.19.  Thus, the sample sizes lead to the results we want.

References

Cumming, G. (2012). Understanding the New Statistics. New York/London: Routledge.

Sahai, H., & Ageel, M. I. (2000). The analysis of variance. Fixed, Random, and Mixed Models. Boston/Basel/Berlin: Birkhäuser.

Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143(5), 2020-2045.