Planning for Precise Contrast Estimates: Introduction and Tutorial (Preprint)

I just finished a preprint of an introduction and tutorial to sample size planning for precision of contrast estimates. The tutorial focuses on single factor between and within subjects designs, and mixed factorial designs with one within and one between factor. The tutorial contains R-code for sample size planning in these designs.

The preprint is availabe on researchgate: Click (but I am just as happy to send it to you if you like; just let me know).

Planning for a precise contrast estimate: the mixed model case

In a previous post (here), we saw how we can determine sample size for obtaining, with assurance, a precise interaction contrast estimate. In that post we considered a 2 x 2 factorial design. In this post, I will extend the discussion to the mixed model case. That is, we will consider sample size planning for a precise interaction estimate in case of a design with 2 fixed factors and two random factors: participant and stimulus (item). (A pdf version of this post can be found here: view pdf. )

In order to keep things relatively simple, we will focus on a design where both participants and items are nested under condition. So, each treatment condition has a unique sample of participants and items. We will call this design the both-within-condition design  (see, for instance, Westfall et al. 2014, for detailed descriptions of this design). We will analyse the 2 x 2 factorial design as a single factor design (the factor has a = 4 levels) and formulate an interaction contrast.


Let’s start with p participants and q stimuli. We randomly assign n= p/a participants and m = q/a stimuli to each of the a treatment levels. The ANOVA table for the design is presented in Table 1.

Expected Mean Squares Both Within Condition Design
 
We will use the ANOVA table to illustrate a few concepts that are important to consider when analysing data using mixed modeling. Maybe you remember that in the previous post, we used the ANOVA source table to obtain an expression for the variance of a contrast. In particular, we used the error variance (MSerror) that is also used to form the F-ratio for testing the interaction effect.
 

Obtaining an apropriate error term

Now, the inclusion of the second random factor (i.e. stimulus in addition to participant) leads, in comparison to the design in the previous post, to a complication. In order to see this, take a look again at the ANOVA table we get when we use SPSS univariate (see Figure 1 and SPSS syntax below). (Important: do not use SPSS GLM Univariate for estimating contrasts in this design; the procedure uses the incorrect standard error; I am using the procedure now just for illustrating a few key concepts).
 
UNIANOVA score BY cond pp ss   
/RANDOM=pp ss   
/METHOD=SSTYPE(3)   
/INTERCEPT=INCLUDE   
/CRITERIA=ALPHA(0.05)   
/DESIGN=cond pp WITHIN cond ss WITHIN cond 
/CONTRAST(cond) = SPECIAL(1, -1, -1, 1).
Figure 1: SPSS GLM ANOVA output
We can see that the effect of condition is not tested against MSError but against an errorterm formed by linearily combining MSpp, MSss, and MSerror. In particular, MSpp and MSss are added and MSerror is subtracted. See footnote a below the source table. It’s a bit hard to explain why that is done, but I’ll have a go at an explanation nonetheless.

 

Take a look at Table 1 and focus on the Participant row. The expected Mean Square (EMS) associated with Participant is m\sigma^2_p + \sigma^2_e. Now, suppose that due to some freak accident of nature there are no differences in the mean scores (averaged of stimuli) of each participant. In that case, \sigma^2_p = 0. This means that under these circumstances the expected mean square associated with participants is simply an estimate of the error variance with p - a degrees of freedom, because m\sigma^2_p + \sigma^2_e = m*0 + \sigma^2_e = \sigma^2_e, if \sigma^2_p = 0. Of course, the other estimate of the Error variance is MSError and this estimate is based on a(n - 1)(m - 1) degrees of freedom. The logic of the F-test is that under the null-hypothesis, in our case that \sigma^2_p = 0, the ratio of these two estimates of the error variance follows an F-distribution with p - a and a(n - 1)(m - 1) degrees of freedom.

Now focus on the Treatment row in Table 1. The expected mean square associated with Treatment equals nm\theta^2_T + m\sigma^2_p + n\sigma^2_s + \sigma^2_e. If we now suppose that there is no difference between the treatment means, that is \theta^2_T = 0, MSTreatment does not estimate \sigma^2_e, but m\sigma^2_p + n\sigma^2_s + \sigma^2_e. Note that no other source of variance has an expected mean square that is equal to the latter figure. That is, in contrast to our test of the Participant factor, where under the null-hypotheses two Mean Squares estimate the error variance, i.e. MSParticipant and MSError, no mean square is available to form an F-ratio to test the Treatment effect.

But a linear combination of MSParticipant, MSStimulus and MSError, does provide an estimate with expected value m\sigma^2_p + n\sigma^2_s + \sigma^2_e. Namely, the sum of the participant and stimulus mean squares minus mean square error: [m\sigma^2_p + \sigma^2_e] + [m\sigma^2_s + \sigma^2_e] - [\sigma^2_e] = m\sigma^2_p + n\sigma^2_s + \sigma^2_e. It is exactly this linear combination of mean squares that is used in the F-ratio to obtain an error term against which to test the Treatment effect in Figure 1: 6.403 + 10.137 - 1.470 = 15.070. We will also use this figure to obtain the variance (and standard error) of our contrast estimate.

Degrees of freedom

If you take a closer look at Figure 1, in particular the degrees of freedom column, you will notice that the degrees of freedom associated with the error term that is used to test the Treatment effect is a fractional number and not a nice round number that you would expect to get if you only consider the degrees of freedom in Table 1. The cause of this fractional number is that we cannot simply use the degrees of freedom of the mean square used to test the treatment effect, because that mean square does not exist. Indeed, we had to combine three mean squares in order to obtain an estimate of the error term for the Treatment effect. The consequence of this is that we will have to use an approximation of the degrees of freedom associated with that error term.

SPSS (and my precision app) use the Satterthwaite procedure to approximate the degrees of freedom of the error term. That approximation is as follows (notice that the numerator is equal to the linear combination of mean squares used to obtain the error term).

    \[df=\frac{(MSp+MSs-MSe)^{2}}{\frac{MSp^{2}}{df_{p}}+\frac{MSs^{2}}{df_{s}}+\frac{MSe^{2}}{df_{e}}}.\]

Thus, using the results in Figure 1.

MSp = 6.403
MSs = 10.137
MSe = 1.470
dfp = 44
dfs = 20
dfe = 220
df = (MSp + MSs - MSe)^2 / (MSp^2/dfp + MSs^2/dfs + MSe^2/dfe)
df
## [1] 37.35559

The margin of error of a contrast estimate

Now that we have obtained the error variance of a treatment effect by using a linear combination of mean squares and a Satterthwaite approximation of the degrees of freedom we are able to figure out the margin of error (MOE) of our contrast estimate. Just as in the simple between subjects design we discussed previously we obtain MOE by multiplying the standard error of the estimate with a critical value of t. The critical value of t is the .975 quantile of the central t-distribution with the Satterthwaite approximated degrees of freedom (if you are looking for something other than a 95% confidence interval, you will have to use another critical value, of course). The following code gives the critical value of t for a 95% confidence interval (change the value of C if you want something other than a 95% interval).

C = .95
alpha = 1 - C
critT = qt(1 - alpha/2, df)
critT
## [1] 2.025542

The standard error of the contrast estimate \hat{\psi} can be obtained as follows.

    \[\hat{\sigma}_{\hat{\psi}}=\sqrt{\sum c_{i}^{2}\hat{\sigma}_{\bar{X},Rel}^{2}},\]

where I have used the symbol \sigma^2_{\bar{X},Rel} to refer to the relative error variance of the treatment mean (which in this design is equal to the absolute error variance, but that’s another story), and c_i refers to the contrast weight of treatment mean i. The relative error variance of the treatment mean is obtained by dividing the error variance that is used to test the treatment effect by the total number of observations in each treament, nm. Thus, using the results in Figure 1.

    \[\hat{\sigma}_{\bar{X},Rel}^{2}=\frac{MS_{p}+MS_{s}-MS_{e}}{nm}=\frac{15.07}{72}=0.2093.\]

If we want to estimate an interaction contrast for the 2 x 2 design, we may, for example, specify contrasts weights {1, -1, -1, 1}. Let’s use the results in Figure 1 to calculate what MOE is for this particular contrast.

#sample sizes per treatment
n = 12
m = 6

#obtained mean squares (see Figure 1): 
MSp = 6.403
MSs = 10.137
MSe = 1.470

#Relative error variance: 
VarT = (MSp + MSs - MSe) / (n*m)

#contrast weights: 
weights = c(1, -1, -1, 1)

#standard error of contrast estimate
SEcontrast = sqrt(sum(weights^2)*VarT)

#Satterthwaite degrees of freedom: 
dfp = 44
dfs = 20
dfe = 220
df = (MSp + MSs - MSe)^2 / (MSp^2/dfp + MSs^2/dfs + MSe^2/dfe)

#critical T 
critT = qt(.975, df)

#Margin of Error 
MOE = critT * SEcontrast

SEcontrast; MOE
## [1] 0.9149985
## [1] 1.853368

SPSS GLM Univariate uses the wrong standard error for a mixed model contrast estimate

Even though SPSS GLM Univariate allows you to specify a mixed model design and tests the treatment effect with a linear combination of mean squares, the procedure does not use the correct error variance if you want to estimate the value of a contrast (using the CONTRAST subcommand), it uses MSError instead. In our example, then, SPSS uses an error variance that is an order of magnitude smaller than the correct error variance: 1.47, with 220 degrees of freedom and not 15.07, with 37.357 (see Figure 1). The consequence of this is, of course, that the 95% CI is much narrower than it should be.

Running the syntax above Figure 1 gives the output in Figure 2. The results can be reproduced as follows. The standard error of the contrast is the result of SE = \sqrt{\sum{c_i^2}\frac{MSe}{nm}} = \sqrt{4*1.47/72} = 0.2858, the critical value of t is the .975 quantile of the central t distribution with df = 220, which equals 1.9708. The value of MOE is therefore MOE = 0.5633. With a contrast estimate of -0.587, the 95% CI equals -0.587 \pm 0.5633 = [-1.1503, -0.0237]. In comparison, using the correct value of MOE gives us [−2.4404, 1.2664].

Figure 2: SPSS GLM Univariate Contrasts Output
Thus, even though SPSS GLM Univariate gives us the ingredients to work with, i.e. an estimate of the error variance and approximate degrees of freedom, it should not be used for obtaining contrast estimates if you have a mixed model. SPSS Mixed does a much better job and the MIXED output also contains other useful data we can use for sample size planning. (In practice, I use the linear mixed effects modeling package LME4) and not so much SPSS). Have a quick look at Figure 3 for the contrast estimate obtained with the mixed procedure. (Note how the numbers are essentially the same as the ones we obtained when using the ANOVA source table of SPSS GLM Univariate (Figure 1)).
MIXED score BY cond   
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) 
SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) 
LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)   
/FIXED= cond | SSTYPE(3)   
/METHOD=REML  
/TEST= 'interaction' cond 1 -1 -1 1   
/RANDOM=INTERCEPT | SUBJECT(pp) COVTYPE(VC)  
/RANDOM=INTERCEPT | SUBJECT(ss) COVTYPE(VC).
Figure 3: Contrast Estimate SPSS Mixed

 

Planning for precision

Even though the result in Figure 3 is hard to interpret without substantive detail (the data are made up) it is clear that the precision of the estimate is, well, suboptimal. As an indication: the estimated within treatment standard deviaion is about 1.74, so the estimated difference between differences (interaction) is close to a value of Cohen’s d of -.30, approximate 95% CI [-1.40, 0.73], which according to the rules of thumb is a medium negative effect, but consistent with anytihing from a huge negative effect to a large positive effect in the population, as the approximate CI shows. (I have divided the point estimate and the confidence interval in Figure 2 by 1.74, to obtain Cohen’s d and an approximate confidence interval). Clearly, then, our precision can be optimized.

Suppose that you are very fond of the both-within-condition design (BwC-design) and you plan to use it again in a replication study, You could of courseopt for a design with better expected precision, but based on the data and estimates at hand, that involves a lot of assumptions, but I will show you how you can do it in one of the next posts. If you plan for precision using the BwC-design, you need the following ingrediënts.

1. A figure for your target MOE. Let’s set target MOE to .40.

2. A specification of the percentage of assurance. Let’s say we want 80% assurance that target MOE will not exceed .40.

3. Estimates (or guesstimates) of the person variance \sigma^2_p, the stimulus variance \sigma^2_s, and the error variance \sigma^2_e. We will have a look in the next section.

4. Functions for calculating the relative error variance, degrees of freedom, MOE and determining the required sample sizes for Participants and Stimuli. These are all present in the Precision App, so I will use the application, but I will show how the results of the sample sizes relate to the information above.

Obtaining estimates of the variance components

We need to specify the values of three variance components. These variance components can be estimated on the basis of the mean squares and sample sizes obtained with SPSS GLM Univariate, we can use SPSS MIXED to obtain direct estimates or any other way to estimate variance components, such as GLM VARCOMPS (which has several estimation procedures). I like to use SPSS MIXED or LME4. and not a dedicated program for variance components, because most of the times the main purpose of the analysis I am doing is obtaining contrast estimate or F-tests, so most of the times variance components estimates are a handy by-product of my main analysis. For demonstrative purposes, I will show how it can be done with the GLM univariate output and I will show how the results match those of SPSS MIXED.

Take a look at Figure 1. The estimate of \sigma^2_e is simply MS(Error) = 1.47. For obtaining an estimate for the variance component associated with participants, we set the obtained mean square equal to the expected mean square (see Table 1). Thus, 6.403 = m\sigma^2_p + \sigma^2_e. Rearranging and using 1.47 as an estimate for \sigma^2_e leads to \sigma^2_p = (6.403 - \sigma^2_e) / m = (6.403 - 1.47) / 6 = 0.8222. Likewise, the estimate for \sigma^2_s = (10.137 - 1.47) / 12 = .7223. Thus, our estimates are \hat{\sigma}^2_e = 1.47, \hat{\sigma}^2_p = 0.8222, and \hat{\sigma}^2_s = 0.7223.

In order to obtain direct estimates you can use SPSS Mixed (or GLM Varcomps, or whatever you like). If you run the SPSS syntax in Figure 3, you will find estimates of the variance components under the heading Covariance Parameters in your SPSS output. See Figure 4. Note that the standard errors are pretty large, so the point estimates are not very precise. But since it is the only information we have, we will consider the point estimates to be the best we have.

Figure 4: Variance components estimates

 

Getting sample sizes with the Precision application

 
Let’s use the Precision app (https://gmulder.shinyapps.io/precision/) for sample size planning. Set the design to Stimulus and Participant within condition, the number of conditions to 4 and in the options for contrast 3 fill in the weights {1, -1, -1, 1} (Note: it is not necessary to fill it in in contrast 3).
 
For target MOE fill in 0.4, for assurance the value .80 and the values 1.47, 0.82, and 0.72 for, respectively, Residual variance, Participant intercept variance and Stimulus intercept variance. Fill in the value 0 for all the other variances. See Figure 5.
 
 
Figure 5: Setting values in the Precision App.
 
Press the button “Get Sample Sizes”. The calculations take a while, so make yourself some coffee (or anything else you like) and when you return the screen should show something like Figure 6a.
Figure 6a: Output for planning target MOE = 0.40
 
Figure 6b: Outpur for planning target MOE = 0.50
 
By the way, if you wonder why you can simply set the three interaction variance components to zero, then it may be nice to know that the variance components estimates obtained from the both-within-condition design already include them. For example, the estimate of the resiidual variance obtained with the both-within-condition design is actually the sum of the residual variance component and the interaction conponent of participant and stimulus. These latter components can only be separated in a fully-crossed-design where all participants respond to all stimuli in all conditions. Thus, if we use the symbol \sigma^2_{e, bwc}, to refer to the residual variance in the BwC-design, we can say \sigma^2_{e, bwc} = \sigma^2_{ps} + \sigma^2_e. Normally, the precision app sums these two components to get a value for the residual variance in the BwC-design, and you will obviously get the same result if you specify the residual variance as the sum and the participant-by-stimulus variance as 0. Likewise, \sigma^2_{p, bwc} = \sigma^2_p + \sigma^2_{cp}, and \sigma^2_{s, bwc} = \sigma^2_s + \sigma^2_{cs}, where \sigma^2_{cp} and \sigma^2_{cs} are the variances associated with the interaction of treatment and participant and treatment and stimulus, respectively.
 
If you look at the sample sizes in Figure 6a, you may notice that the numbers look odd. For example, the app says that the smallest number of stimuli is 877 but it also says that you only need 500 stimuli if you select 802 participants. And something like that happens to the participants as well. The output says that the smaller number of participants is 802, but it also suggest using 500 of them if you use 877 stimuli, which is clearly smaller than 802. To me this seems a little inconsistent. But I think I figured out what’s going on. The reason for these inconsistencies is that the application minimizes the sample sizes, but with a maximum of 500 for the other sample sizes. So, the smallest number of stimuli is 877 given that the maximum number of participants is 500. In other words, a smaller sample size is possible, but then we have to increase the maximum number of participants. In other words, in order to have 80% assurance to obtain a target MOE of no more than .40, we need at least 500 stimuli or at least 500 participants. If you look at Figure 6b, you will not notice these inconsistencies. The difference between the left and right sample sizes is that sizes on the right are based on a target MOE of .50 instead of .40.
 
According to Figure 6a, we can obtain our target if we use 802 participants and 500 stimuli. Since we are planning for an experiment with 4 treatment conditions, these total sample sizes need to be divided by 4 to get the sample sizes per treatment conditions. Thus, n = 804/4 = 201 participants, and m = 500 / 4 = 125 stimuli per treatment condition (I’ve increased the participants sample size to make it divisible by 4). For many experiments these numbers are impractically large, of course, so in this case you would probably either consider an alternative design or else you have to live with the message that you may not get the precision you want or need.
 

Checking the sample size suggestions using what we know

If we fill in the sample sizes (802 participants and 500 stimuli) in the Precision app we get the results presented in Figure 7 for the interaction contrast (contrast 3). Expected MOE equals 0.3927, and there is 80% assurance that MOE will not exceed 0.4065. Note, again, that the assurance MOE is somewhat larger than target MOE, because a sample of 804 participants requires a sample of more than 500 stimuli to get the target MOE with 80% assurance and 500 stimuli is the maximum number of stimuli the app considers when minimizing the number of participants.
 
Figure 7: Expected and Assurance MOE for the interaction contrast (contrast 3) using 804 participants and 500 stimuli
Let’s see if we can reconstruct the figures using what we know from previous sections. First the relative error variance of the treatment mean. That relative error variance is (m\sigma^2_p + n\sigma^2_s + \sigma^2_e)/ nm = 0.0099.
 
The degrees of freedom can be calculated by first filling in the expected mean squares and the degrees of freedom presented in Table 1: MS_p = 125*.82 + 1.47 = 103.96, df_p = 800, MS_s = 201*.72 + 1.47 = 146.19, df_s = 496, MS_e = 1.47, and df_e = 4*(201 - 1)*(125 - 1) = 99200. The Satterthwaite degrees of freedom are (MS_p + MS_s - MS_e)^2 / (MS_p^2/df_p + MS_s^2/df_s + MS_e^2/df_e) = 1092.66. The standard error of the contrast equals \sqrt{4*.0099} = 0.1990. The critical value for t equals 1.9621. Expected MOE is, therefore, 0.3905 (the tiny difference with the results from the app is due to rounding errors).
 
For the calculation of assurance MOE we need to take the sampling distribution of the relative error variance of the treatment mean into account. The app uses the (scaled) \chi^2-distribution. That is, we assume with assurance \gamma, that the \gamma quantile of the sampling distribution of the relative error error variance is \sigma^2_{\bar{X}, rel}*\chi^2_{\gamma, df}/df. Now, the degrees of freedom are 1092.66, the assurance \gamma = .80, and the .80 quantile of \chi^2 with 1092.66 degrees of freedom equals 1131.7966. Since the relative error variance equals 0.0099, the .80 quantile of the error variance equals 0.0099*1131.7966/1092.66 = 0.0103. And this means that assurance MOE equals 1.9621*\sqrt{4*0.0103} = 0.3982. Again, the difference with the results from the Precision App are due to rounding error.

Planning for a precise interaction contrast estimate

In my previous post (here),  I wrote about obtaining a confidence interval for the estimate of an interaction contrast. I demonstrated, for a simple two-way independent factorial design, how to obtain a confidence interval by making use of the information in an ANOVA source table and estimates of the marginal means and how a custom contrast estimate can be obtained with SPSS.

One of the results of the analysis in the previous post was that the 95% confidence interval for the interaction was very wide. The estimate was .77, 95% CI [0.04, 1.49]. Suppose that it is theoretically or practically important to know the value of the contrast to a more precise degree.  (I.e. some researchers will be content that the CI allows for a directional qualitative interpretation: there seems to exist a positive interaction effect, but others, more interested in the quantitative questions may not be so easily satisfied).  Let’s see how we can plan the research to obtain a more precise estimate. In other words, let’s plan for precision.

Of course, there are several ways in which the precision of the estimate can be increased. For instance, by using measurement procedures that are designed to obtain reliable data, we could change the experimental design, for example switching to a repeated measures (crossed) design, and/or increase the number of observations. An example of the latter would be to increase the number of participants and/or the number of observations per participant.  We will only consider the option of increasing the number of participants, and keep the independent factorial design, although in reality we would of course also strive for a measurement instrument that generally gives us highly reliable data. (By the way, it is possible to use my Precision application to investigate the effects of changing the experimental design on the expected precision of contrast estimates in studies with 1 fixed factor and 2 random factors).

The plan for the rest of this post is as follows. We will focus on getting a short confidence interval for our interaction estimate, and we will do that by considering the half-width of the interval, the Margin of Error (MOE). First we will try to find a sample size that gives us an expected MOE (in repeated replication of the experiment with new random samples) no more than a target MOE. Second, we will try to find a sample size that gives a MOE smaller than or equal to our target MOE in a specifiable percentage (say, 80% or 90%) of replication experiments. The latter approach is called planning with assurance.

Let us get back to some of the SPSS output we considered in the previous post to get the ingredients we need for sample size planning. First, the ANOVA table.

Table 1. ANOVA source table

We are interested in estimating and optimizing the precision of an interaction contrast estimate. The first things we need are an expression of the error variance needed to calculate the standard error of the estimate and the degrees of freedom that were used in estimating the error variance. In general, the error variance needed is the same error variance you would use in performing an F-test for the specific effect, in this case the interaction effect.

Thus, we note the error variance used to test the interaction effect, i.e. mean square error, and the degrees of freedom. The value of mean square error is 3.324, and the degrees of freedom are 389. Note that this value is the total sample sizes minus the number of conditions (393 – 4 = 389), or, equivalently, the total sample sizes minus the degrees of freedom of the intercept, the main effects, and the interaction (393 – (1 + 1 + 1 + 1) = 389).  I will call these degrees of freedom the error degrees of freedom, dfe.

MOE can be obtained by multiplying a critical t-value with the same degrees of freedom as the error degrees of freedom with the standard error of the estimate.

The standard error of the contrast estimate is

    \[\hat{\sigma_\psi}= \sqrt{\sum{c_i^2MS_e/n_i}},\]

where c_i is the contrast weight for the i-th condition mean, and n_i the number of observations (in our example participants) in treatment condition i.  Note that MS_e / n_i is the variance of  treatment mean i, the square root of which gives the familiar standard error of the mean.

The contrast weights we used to estimate the 2 x 2 interaction were {-1, 1, 1, -1}. So, the expression for MOE becomes

    \[MOE =  t_{.975}(df_e)\sqrt{\sum{c_i^2MS_e/n_i}}=t_{.975}(df_e)\sqrt{4MS_e/n_i} = 2t_{.975}(df_e)\sqrt{MS_e/n_i}.\]

Thus, suppose we have the independent 2×2 factorial design, n_i = 100, and the true value of Mean Square Error is 3.324, then MOE for the contrast estimate equals

    \[MOE = 2*t_{.975}(396)*\sqrt{3.324/100} = 0.7071\]

.
Note that this is the value of MOE we obtain on average in repeated replications with new samples, if we use sample sizes of 100 (total number of participants is 400) and if the true value of the error variance is 3.324.  The value is close to the value we obtained in the previous post (MOE = 0.72) because the sample sizes were very close to 100 per group.

Now, we found the original confidence interval too wide, and we have just seen how 100 participants per group does not really help. MOE is only slightly smaller than our originally obtained MOE. We need to set a target MOE and then figure out how many participants we need to get that target MOE.

Intermezzo: Rules of thumb for target MOE

(Here are some updated rules of thumb: https://the-small-s-scientist.blogspot.com/2018/11/contrast-tutorial.html)

In the absence of theoretical or practical considerations about the precision we want, we may want to use rules of thumb. My (very first proposal for) rules of thumb are based on the default interpretations of Cohen’s d. Considering the absolute values of d ≤ .10 to be negligible d = .20 small, d = .50 medium and d = .80 large. (I really do not like rules-of-thumb, because using them is a sign that you are not thinking).

Now, suppose that we interpret the confidence interval as a range of plausible values for the true value of the effect size. It is not at all clear to me what such a supposition entails, but let’s simply take it for granted right now (please don’t). Then, I think it is reasonable to say that being able to distinguish between small and negligible effects sizes is relatively precise. Thus a MOE of .05 (pooled) standard deviations  can be considered precise because (on average) the 95% CI for the small effect sizes is [.15, .25], assuming we know the value of the standard deviation, so negligible effects will not be deemed plausible values on average, since effect sizes smaller than .10 are outside the interval.

By essentially the same reasoning. if we cannot distinguish between large and negligible effects, we are not estimating things very precisely. Therefore, a MOE of .80 standard deviations can be considered to be not very precise. On average, the CI for an existing large effect, will be [0,  1.60], so it includes both negligible and very large effects as plausible values.

For medium (does it make sense to speak of medium precision?) precision I would like to suggest .20-.25 standard deviations. On average, with this value for MOE, if there is a medium effect, small effects and large effects are relatively implausible.  In the case of small effects, medium precision entails that on average both effects in the opposite direction and medium effects are among the plausible values.

Of course, I am interpreting the d-values as strict boundaries, but the scale is not categorical, but continuous. So instead of small, large effect sizes, it’s better to speak of smallish and largish effect sizes. And as soon as I find a variant for medium effects sizes I will also include that term in the list.

Note: sample size planning may indicate that precision of MOE = .20-.25 standard deviations is unattainable. In that case, we will simply have to accept that our precision does not lead to confident conclusions about the population effect size. (Once I showed one of my colleagues my precision app, during which he said: “that amount of precision requires a very large sample. I do not like your ideas about sample size planning”).

(By the way, I am also considering rules-of-thumb for target MOE that include assurance. Something like: high precision is when repeated experiments have a high probability of distinguishing small and negligible effects; in that case the average MOE will be smaller than .05).

Planning for precision

Let’s plan for a precision of 0.25 standard deviation. In our case, that standard deviation is the pooled standard deviation: the square root of Mean Square Error. The (estimated) value of  Mean Square Error is 3.324 (see Table 1), so our value for the standard deviation is 1.8232.  Our target MOE is, therefore, 0.4558.
Let’s make things very clear. Here we are planning for a target MOE based on an estimate of the pooled standard deviation (and on assumptions about the population distribution). In order for our planning to be of practical value, we need some reassurance that that estimate is trustworthy. One way of doing that is to consider the CI for the standard deviation. I will not discuss that topic, and simply give you a CI: [2.90,  3.86].
Take a look at the expression for MOE.

    \[MOE = 2*t_{.975}(df_e)\sqrt{(MS_w / n_i)},\]

where df_e = 4(n_i - 1), since we are considering the 2×2 design.

Since our target MOE equals .4588, our goal becomes to solve the following equation for n_i, since we want the sample size:
 

    \[0.4558 = 2*t_{.975}(4(n_i - 1)\sqrt{(MS_w / n_i)},\]

However, because n_i determines both the standard error and the degrees of freedom (and thereby the critical value of t), the equation may be a little hard to solve.  So, I will create a function in R that enables me to quite easily get the required sample size. (It is relatively easy to create a more general function (see the Precision App), but here I will give an example tailored to the specific situation at hand).

First we create a function to calculate MOE:

MOE = function(n) {
  MOE = 2*qt(.975, 4*(n - 1))*sqrt(3.324/n)
}

Next, we will define a loss function and use R’s built-in optimize function to determine the sample size. Note that the loss-function calculates the squared difference between MOE based on a sample size n and our target MOE. The optimize function minimizes that squared difference in terms of sample size n (starting with n = 100 and stopping at n = 1000).

loss <- function(n) {
  (MOE(n) - 0.4558)^2
}

optimize(loss, c(100, 1000))
## $minimum
## [1] 246.4563
## 
## $objective
## [1] 8.591375e-18

Thus, according to the optimize function we need 247 participants (per group; total N = 988), to get an expected MOE equal to our target MOE. The expected MOE equals 0.4553, which you can confirm by using the MOE function we made above.

Planning with assurance

Although expected MOE is close to our target MOE, there is a probability 50% that the obtained MOE will be larger than our target MOE.  In other words, repeated sampling will lead to obtained MOEs larger than what we want. That is to say, we have 50% assurance that our obtained MOE will be at least as small as our target MOE.
Planning with assurance means that we aim for a certain specified assurance that our obtained MOE will not exceed our target MOE. For instance, we may want to have 80% assurance that our obtained MOE will not exceed our target MOE.
Basically, what we need to do is take the sampling distribution of the estimate of  Mean Square Error into account. We use the following formula (see also my post introducing the Precision App for the general formulae: https://the-small-s-scientist.blogspot.nl/2017/04/planning-for-precision.html).

    \[MOE_{\gamma} = 2*t_{.975}(df)*\sqrt{MS_w/n_i*\chi^2_{\gamma}(df)/df},\]

where gamma is the assurance expressed in a probability between 0 and 1.

Let’s do it in R. Again, the function that calculates assurance MOE is  tailored for the specific situation, but it is relatively easy to formulate these functions in a generally applicable way,
MOE.gamma = function(n) {
  df = 4*(n-1)
  MOE = 2*qt(.975, df)*sqrt(3.324/n*qchisq(.80, df)/df)
}
loss <- function(n) {
  (MOE.gamma(n) - 0.4558)^2
}

optimize(loss, c(100, 1000))
## $minimum
## [1] 255.576
## 
## $objective
## [1] 2.900716e-18

Thus, according to the results, we need 256 persons per group (N = 1024 in total) to have a 80% probability of obtaining a MOE not larger than our target MOE. In that case, our expected MOE will be 0.4472.

Planning for Precision: Introduction to variance components

Both theory underlying the Precision application and the use of the app in practice rely for a large part on specifying variance components. In this post, I will give you some more details about what these components are, and how they relate to the analysis of variance model underlying the app.

What is variance?

Let’s start with a relatively simple conceptual explanation of variance. The key ideas are expected value and error.  Suppose you randomly select a single score from a population of possible values. Let’s suppose furthermore that the population of values can be described with a normal distribution. There is actually no need to suppose a normal distribution, but it makes the explanation relatively easy to follow.

As you probably know, the normal distribution is centered around its mean value, which is (equal to) the parameter μ. We call this parameter the population mean.

Now, we select a single random value from the population. Let’s call this value X.  Because we know something about the probability distribution of the population values, we are also in the position to specify an expected value for the score X. Let’s use the symbol E(X) for this expected value. The value of E(X) proves (and can be proven) to be equal to the parameter μ. (Conceptually, the expectation of a variable can be considered as its long run average).

Of course, the actual value obtained will in general not be equal to the expected value, at least not if we sample from continuous distributions like the normal distribution. Let’s call the difference between the value X and it’s expectation E(X) = μ. an error, deviation or residual: e = X – E(X) = X –  μ.

We would like to have some indication of the extent to which X differs from its expectation, especially when E(X) is estimated on the basis of  a statistical model.  Thus, we would like to have something like E(X – E(X)) = E(X –  μ). The variance gives us such an indication, but does so in squared units, because working with the expected error itself always leads to the value 0  E(X –  μ) = E(X) – E( μ) =  μ –  μ = 0. (This simply says that on average the error is zero; the standard explanation is that negative and positive errors cancel out in the long run).

The variance is the expected squared deviation (mean squared error) between X and its expectation: E((X – E(X))2) = E(X –  μ)2), and the symbol for the population value is σ2.

Some examples of variances (remember we are talking conceptually here):
– the variance of the mean, is the expected squared deviation between a sample mean and its expectation the population mean.
– the variance of the difference between two means:  the expected squared deviation between the sample difference and the population difference between two means.
– the variance of a contrast: the expected squared deviation between the sample value of the contrast and the population value of the contrast.

It’s really not that complicated, I believe.

Note: in the calculation of t-tests (and the like), or in obtaining confidence intervals, we usually work with the square root of the variance: the root mean squared error (RMSE), also known as the standard deviation (mostly used when talking about individual scores) or standard error (when talking about estimating parameter values such as the population mean or the differences between population means).

What are variance components?  

In order to appreciate what the concept of a variance component entails, imagine an experiment with nc treatment conditions in which np participants respond to ni items (or stimuli) in all conditions. This is called a fully-crossed experimental design.

Now consider the variable Xcpi. a random score in condition t, of participant p responding to item i. The variance of this variable is σ2(Xcpi) = E((Xcpi –  μ)2). But note that just as the single score is influenced by e.g. the actual treatment condition, the particular person or item, so too can this variance be decomposed into components reflecting the influence of these factors. Crucially, the total variance  σ2(Xcpi) can be considered as the sum of independent variance components, each reflecting the influence of some factor or interaction of factors.

The following figure represents the components of the total variance in the fully crossed design (as you can see the participants are now promoted to actual persons…)

The symbols used in this figure represent the following. Θ2 is used to indicate that a component is considered to be a so-called fixed variance component (the details of which are beyond the scope of this post), and the symbol σ2 is used to indicate components associated with random effects. (The ANOVA model contains a mixture of fixed and random effects, that’s why we call such models mixed effects models or mixed model ANOVAs). Components with a single subscript represent variances associated with main effects, components with two subscripts two-way interactions, and the only component with three subscripts represents a three-way-interaction confounded with error.

Let’s consider one of these variance components (you can also refer to them simply as variances) to see in more detail how they can be interpreted. Take σ2(p), the person-variance. Note that this is an alternative symbol for the variance, in the figure, the p is in the subscript in stead of between brackets (I am sorry for the inconvenience of switching symbols, but I do not want to rely too much on mathjax ($sigma^2_p = sigma^2(p)$) and I do not want to change the symbols in the figure).

The variance σ2(p) is the expected squared deviation of the score of an idealized randomly selected person and the expectation of this score, the population mean. This score is the person score averaged of the conditions of the experiment and all of the items that could have been selected for the experiment. (This conceptualization is from Generalizability Theory; the Venn-diagram representation as well), the person score is also called the universe score of the person).

The component represents E((μp – μ)2), the expected squared person effect. Likewise, the variance associated with items σ2(i) is the expected squared item effect, and σ2(cp), is the expected squared interaction effect of condition and person.

The figure indicates that the total variance is (modeled as) the sum of seven independent variance components. The Precision app asks you to supply the values for six of these components (the components associated with the random effects), and now I hope it is a little clearer why these components are also referred to as expected squared effects.

Relative error variance of a treatment mean

The basic goal of contrast analysis is to compare a mean (or more) relative to one (or more) other treatment means. That is, we are interested in the relative position of a mean compared to the rest.  Due to sampling error our estimate of the relative position of a mean differs from the ‘true’ relative position. The relative error variance of a treatment mean is the expected squared deviation between the obtained relative position and the expected relative position. 
Let’s consider the Venn-diagram above again. In particular, take a look at the condition-circle and the components contained in it. A total of four variance components are included in the circle. The component  Θ2(c), is the component associated with the treatment effect (μc – μ).  All the other components contained in the circle contribute to the relative error variance. Thus, the interaction of treatment and participant, the interaction of treatment and stimulus, and the error contribute to the deviation between the true effect (relative position) and the estimated effect. Or, in other words, the relative error variance of the treatment mean consists of the three variance components associated with the treatment by participant interaction, the treatment by stimulus interaction, and error.

But these components specify the variance in terms of individual measurements, whereas the treatment mean is obtained on the basis of averaging over the np*ni measurements we have in the corresponding treatment condition. So let’s see how we can take into account the number of measurements.

Unfortunately, things get a little complicated to explain, but I’ll have a go nonetheless.  The explanation takes two steps: 1) We consider how to specify expected mean squares in terms of the components contained in the condition-circle 2) We’ll see how to get from the formulation of the expected mean square to the formulation of the relative error variance of a treatment mean.

Obtaining an expected mean square

I will use the term Mean Square (MS) to refer to a variance estimate. For instance, MST is an estimate of the variance associated with the treatment condition. The expected MS (EMS) is the average value of these estimates.

We can use the Venn-diagram to obtain an EMS for the treatment factor and all the other factors in the design, but we will focus on the EMS associated with treatment. However, we cannot use the variance components directly, because the mixed ANOVA model I have been using for the application contains sum-to-zero restrictions on the treatment effects and the two-way interaction-effects of treatment and participant and treatment and stimulus (item).  The consequence of this is that we will have to multiply the variance components associated with the treatment-by-participant and treatment-by-stimulus with a constant equal to nc / (nc – 1), where nc is the number of conditions.(This is the hard part to explain, but I didn’t really explain it, but simply stated it).

The second step of obtaining the EMS is to multiply the components with the number of participants and items, as follows. Multiply the component by the sample size if and only if the subscript of the component does not contain a reference to the particular sample size. That is, for instance, multiply a component by the number of participants np, if and only if the subscript of the component does not contain a subscript associated with participants.

This leads to the following:

 E(MST) = npniΘ2(T) + ni(nc / (nc – 1))σ2(cp) + np(nc / (nc – 1))σ2(ci) + σ2(cpi, e).

Obtaining the relative error variance of the treatment mean

Notice that  ni(nc / (nc – 1))σ2(cp) + np(nc / (nc – 1))σ2(ci) + σ2(cpi, e) contains the components associated with the relative error variance. Because the treatment mean is based on np*ni scores, to obtain the relative error variance for the treatment mean, we divide by np*ni to obtain.

Relative error variance of the treatment mean = (nc / (nc – 1))σ2(cp) / np + (nc / (nc – 1))σ2(ci) / ni+ σ2(cpi, e) /  (npni).

As an aside, in the post describing the app, I have used the symbols σ2(αβ) to refer to nc / (nc – 1))σ2(cp), and σ2(αγ) to refer to (nc / (nc – 1))σ2(ci).

Comparing means: the error variance of a contrast 

From the relative error variance of a treatment mean, we can get to the variance of a contrast, simply by multiplying the relative error variance by the sum of the squared contrast weights. For instance, if we want to compare two  treatment means we can do so by estimating the contrast ψ = 1*μ1 + (-1)*μ2, where the values 1 and (-1) are the contrast weights. The sum of the squared contrast weights equals 2, the error variance of the contrast is therefore 2*(nc / (nc – 1))σ2(cp) / np + (nc / (nc – 1))σ2(ci) / ni+ σ2(cpi, e) /  (npni).

Note that the latter gives us the expected squared deviation between the estimated contrast value and the true contrast value (see also the explanation of the concept of variance above).

It should be noted that for the calculation of a 95% confidence interval for the contrast estimate (or for the Margin of Error; the half-width of the confidence interval) we make use of the square root of the error variance of the contrast. This square root is the standard error of the contrast estimate. The calculation of MOE also requires a value for the degrees of freedom. I will write about forming a confidence interval for a contrast estimate in one of the next posts. 

Planning for Precision: simulation results for four designs with four conditions

This is the third post about the Planning for Precision app (in the future I’ll explain the difference between Planning for Precision and Precision for Planning). Some background information about the application can be found here: http://the-small-s-scientist.blogspot.com/2017/04/planning-for-precision.html. 

In this post, I want to present the simulation results for 4 designs with 4 conditions. The designs are: the counter balanced design (see previous post), the fully-crossed design, the stimulus-within-condition design, and the stimulus-and-participant-within-condition design (the both-within-condition design). I have not included the participants-within-condition design, because this is simply the mirror-image (so to say) of the stimulus-within-condition design.

In one of my next posts, I will describe some more background information about planning for precision, but some of the basics are as follows. We have a design with 4 treatment conditions, and what we want do is to estimate differences between these condition means by using contrasts. For instance, we may be interested in the (amount of) difference between the first mean, maybe because it is a control-condition with the average of the other three conditions: μ1 – (μ2 + μ3 + μ4)/3 =  1*μ1 – 1/3*μ2 -1/3*μ3 – 1/3*μ4.  The values {1, -1/3, -1/3, -1/3} are the contrast weights, and for the result we use the term ψ.

The value of ψ is estimated on the basis of estimates of the population means, that is, the sample means or condition means. Due to sampling error, the contrast estimate varies from sample to sample and the amount of sampling error can be expressed by means of a confidence interval. Conceptually, the confidence interval expresses the precision of the estimate: the wider the confidence interval, the less precise the estimate is.

The Margin of Error (MOE) of an estimate is the half-width of the confidence interval, so the confidence interval is the estimate plus or minus MOE. We will take MOE as an expression of the precision of the estimate (the less the value of MOE the more precise the estimate).  Now, if you want to estimate an effect size, more precision (lower value of MOE; less wide confidence interval) is better than less precision (higher value of MOE; wider confidence interval).  The app let’s you specify the design and the contrast weights and helps you find the minimum required sample sizes (for participants and stimuli) for a given target MOE. (You can also play with the designs to see which design gives you smallest expected MOE).

Crucially, if you plan for precision, you also want to have some assurance that the MOE you are likely to obtain in you actual experiment will not be larger than you target MOE. Compare this with power: 80% power means that the probability that you will reject the null-hypothesis is 80%. Likewise, assurance MOE of 80% means that there is an 80% probability that your obtained MOE will be no larger than assurance MOE.

The simulations (with N = 10000 replications) estimate Expected MOE as well as Assurance MOE for assurances of .80, .90, .95, and .99, for 4 designs with 4 treatment conditions, with a total number of 48 participants and 24 stimuli (items).  The MOEs are given for three standard constrasts: 1) the difference between the first mean and the mean of the other three, with weights {1, -1/3, -1/3, -1/3}; 2) the difference between the second mean and the mean of conditions three and four, with weights {0, 1, -1/2, -1/2}; 3) the difference between the third and fourth condition means, with weights {0, 0, 1, -1}.

I will present the results in  separate tables for the 4 designs considered and include percentage difference between expected values of assurance MOE and the estimated values estimated values.

The fully crossed design 

The results are in the following table.

The percentage difference between the expected quantiles (= assurance MOEs for given insurance;  i.e. q.80 is expected or estimated  80% Assurance MOE) and the estimated quantiles are: .80: 0.11%; .90: 0.05%; .95: -0.14%; 99: -0.05%.

The counter balanced design 

The results are presented in the following table. 

The percentage difference between the expected quantiles and the estimated quantiles are: .80: 0.03%; .90: 0.13%;  .95: 0.09%, .99: -0.23%.

The stimulus-within-condition design 

The following table contains the details. 
The percentage difference between the expected quantiles and the estimated quantiles are: .80: -0.11%; .90: -0.33%;  .95: -0.55%, .99: -0.70%.  

Both-participant-and-stimulus-within-condition design 

Here is the table. 
And the percentage differences are: .80: -0.34%; .90: -0.59%;  .95: -0.82%;  .99: -1.06%. 

Conclusion

The results show that the simulation results are quite consistent with the expected values based on mixed model ANOVA. We can see that the differences between expected and estimated values increase the less the number of participants and items per condition. For instance, in the both within condition design 12 participants respond to 6 stimuli in one of the four treatment conditions. The fact that even with these small samples sizes the results seem to agree to an acceptable degree is (to my mind) encouraging. Note that with small samples the expected assurance MOES are slightly lower than the estimates, but the largest difference is -1.06% (see the MOE for 99% assurance). 

Planning for Precision: first simulation results

In this post, I want to share the results of the first simulation study to “test” my Planning for Precision app. More details about the app can be found in a previous post: here.

I have included the basic logic of the simulations (including R code) in a document that you can download: https://drive.google.com/open?id=0B4k88F8PMfAhSlNteldYRWFrQTg.

The simulation study simulates responses from a four condition counter balanced design, with p = 48 participants and q = 24 stimuli/items. Here, we will focus on expected and assurance MOE for three contrasts. The first contrast estimates the difference between the first mean and the average of the other three, the second contrast the difference between the second mean and the average of the third and fourth means, and the final contrast the difference between the means of the third and fourth contrasts.

Expected MOE is compared to the mean of the estimated MOE for each of the contrasts (based on 10000 replications). Assurance MOE is judged for assurance of .80, .90, .95 and .99, by comparing the calculations in the app with the corresponding quantile estimates of the simulated distributions.

Results 

Note that in the above table, the Expected Mean MOE is what I have called Expected MOE, and the q.80 through q.99 are quantiles of the distribution of MOE. As an example, q.80 is the quantile corresponding to assurance MOE with 80% assurance, Expected q.80 is the value of assurance MOE calculated with the theoretical approach, and Estimated q.80 is the estimated quantile based on the simulation studies. 
Importantly, we can see that most of the figures agree to a satisfying degree. If we look at the relative differences, expressed in percentages for the assurance MOEs, we get 0.0325% for q.80,  0.1260% for q.90, 0.0933% for q.95, and  -0.2324% for q.99. 

Conclusion

The first simulation results seem promising. But I still have a lot of work to do for the rest of the designs. 

Planning for precision with samples of participants and items

Many experiments involve the (quasi-)random selection of both participants and items. Westfall et al. (2014) provide a Shiny-app for power-calculations for five different experimental designs with selections of participants and items. Here I want to present my own Shiny-app for planning for precision of contrast estimates (for the comparison of up to four groups) in these experimental designs.  The app can be found here: https://gmulder.shinyapps.io/precision/

(Note: I have taken the code of Westfall’s app and added code or modified existing code to get precision estimates in stead of power; so, without Westfall’s app, my own modified version would never have existed).

The plan for this post is as follows. I will present the general theoretical background (mixed model ANOVA combined with ideas from Generalizability Theory) by considering comparing three groups in a counter balanced design.
Note 1: This post uses mathjax, so it’s probably unreadable on mobile devices. Note: a (tidied up) version (pdf) of this post can be downloaded here: download the pdf
Note 2: For simulation studies testing the procedure go here: https://the-small-s-scientist.blogspot.nl/2017/05/planning-for-precision-simulation.html
Note 3: I use the terms stimulus and item interchangeably; have to correct this to make things more readable and comparable to Westfall et al. (2014).
Note 4: If you do not like the technical details you can skip to an illustration of the app at the end of the post.

The general idea

The focus of planning for precision is to try to minimize the half-width of a 95%-confidence interval for a comparison of means (in our case). Following Cumming’s (2012) terminology I will call this half-width the Margin of Error (MOE). The actual purpose of the app is to find required sample sizes for participants and items that have a high probability (‘assurance’) of obtaining a MOE of some pre-specified value.

Expected MOE for a contrast

For a contrast estimate \hat{\psi}  we have the following expression for the expected MOE. 

    \[E(MOE) = t_{.975}(df)*\sigma_{\hat{\psi}},\]

where \sigma_{\hat{\psi}} is the standard error of the contrast estimate. Of course, both the standard error and the df are functions of the sample sizes.

For the standard error of a contrast with contrast weights c_i through c_a, where a is the number of treatment conditions,  we use the following general expression.

    \[\sigma_{\hat{\psi}} = \sqrt{\sum c^2_i \frac{\sigma^2_w}{n}},\]

where n is the per treatment sample size (i.e. the number of participants per treatment condition times the number of items per treatment condition) and \sigma^2_w the within treatment variance (we assume homogeneity of variance).

For a simple example take an independent samples design with n = 20 participants responding to 1 item in one of two possible treatment conditions (this is basically the set up for the independent t-test). Suppose we have contrast weights c_1 = 1 and c_2 = -1, and \sigma^2_w = 20, the standard error for this contrast equals \sigma_{\hat{\psi}} = \sqrt{\sum c^2_i\frac{\sigma^2_w}{n}} = \sqrt{2*20/20} = \sqrt{2}.  (Note that this is simply the standard error of the difference between two means as used in the independent samples t-test).

In this simple example, df is the total sample size (N = n*a) minus the number of treatment conditions (a), thus df =  N - a = 38. The expected MOE for this design is therefore, E(MOE) = t_{.975}(38)*\sqrt{2} = 2.0244*1.4142 = 2.8629. Note that using these figures entails that 95% of the contrast estimates will take values between the true contrast value plus and minus the expected MOE: \psi \pm 2.8629.

For the three groups case, and contrast weights {1, -1/2, -1/2}, the same sample sizes and within treatment variance gives E(MOE) = t_{.975}(57)*\sqrt{1.5*\frac{20}{20}} =  2.4525.

(If you like, I’ve written a little document with derivation of the variance of selected contrast estimates in the fully crossed design for the comparison of two and three group means. That document can be found here: https://drive.google.com/open?id=0B4k88F8PMfAhaEw2blBveE96VlU)

The focus of planning for precision is to try to find sample sizes that minimize expected MOE to a certain target MOE.  The app uses an optimization function that minimizes the squared difference between expected MOE and target MOE to find the optimal (minimal) sample sizes required.

Planning with assurance

If the expected MOE is equal to target MOE,  the sample estimate of MOE will be larger than your target MOE in 50% of replication experiments. This is why we plan with assurance (terminology from Cumming, 2012).  For instance, we may want to have a 95% probability (95% assurance) that the estimated MOE will not exceed our target MOE.

In order to plan with assurance, we need (an approximation of) the sampling distribution of MOE. In the ANOVA approach that underlies the app, this boils down to the distribution of estimates of \sigma^2_w

    \[MS_w \sim \sigma^2_w*\chi^2(df)/df,\]

thus

    \[\hat{MOE} \sim t_{.975}(df)\sqrt{\frac{1}{n}\sum{c_i^2}\sigma^2_w*\chi^2(df)/df}.\]

In terms of the two-groups independent samples design above: the expected MOE equals 2.8629. But, with df = 38, there is an 80% probability (assurance) that the estimated MOE will be no larger than:

    \[\hat{MOE}_{.80}  = 2.0244 * \sqrt{1/20*2*20*45.07628/38} = 3.1181.\]

Note that the 45.07628 is the quantile q_{.80} in the chi-squared (df = 38) distribution. That is P(\chi^2(38) \leq 45.07628) = .80.

The app let’s  you specify a target MOE and a value for the desired assurance (\gamma) and will find the combination of number of participants and items that will give an estimated MOE no larger than target MOE in \gamma% of the replication experiments.

The mixed model ANOVA approach

Basically, what we need to plan for precision is to able to specify \sigma^2_w and the degrees of freedom. We will specify \sigma^2_w as a function of variance components and use the Satterthwaite procedure to approximate the degrees of freedom by means of a linear combination of expected mean squares. I will illustrate the approach with a three-treatment conditions counterbalanced design.

A description of the design

Suppose we are interested in estimating the differences between three group means. We formulate two contrasts: one contrast estimates the mean difference between the first group and the average of the means of the second and third groups. The weights of the contrasts are respectively {1, -1/2, -1/2}, and {0, 1, -1}.

We are planning to use a counterbalanced design with a number of participants equal to p and a sample of items of size q. In the design we randomly assign participants to a groups, where a is the number of conditions, and randomly assign items to a lists (see Westfall et al., 2014 for more details about this design). All the groups are exposed to all lists of stimuli, but the groups are exposed to different lists in each condition. The number of group by list combinations equals a^2, and the number of observations in each group by list combination equals \frac{1}{a^2}pq. The condition means are estimated by combining a group by list combinations each of which composed of different participants and stimuli. The total number of observations per condition is therefore, \frac{a}{a^2}pq = \frac{1}{a}pq.

The ANOVA model

The ANOVA model for this design is

    \[Y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_k +  (\alpha\beta)_{ij} + (\alpha\gamma)_{ik} + e_{ijk},\]

where the effect \alpha_i is a constant treatment effect (it’s a fixed effect), and the other effect are random effects with zero mean and variances \sigma^2_\beta (participants), \sigma^2_\gamma (items), \sigma^2_{\alpha\beta} (person by treatment interaction), \sigma^2_{\alpha\gamma} (item by treatment interaction) and \sigma^2_e (error variance confounded with the person by item interaction). Note: in Table 1 below, \sigma^2_e is (for technical reasons not important for this blogpost) presented as this confounding [\sigma^2_{\beta\gamma} + \sigma^2_e].

We make use of the following restrictions (Sahai & Ageel, 2000): \sum_{i = 1}^a = 0, and \sum_{i=1}^a(\alpha\beta)_{ij} = \sum_{i=1}^a(\alpha\gamma)_{ik} = 0. The latter two restrictions make the interaction-effects correlated across conditions (i,e. the effects of person and treatment are correlated across condition for the same person, likewise the interaction effects of item and treatment are correlated across conditons for the same item. Interaction effects of different participants and items are uncorrelated). The covariances between the random effects \beta_j, \gamma_i, (\alpha\beta)_{ij}, (\alpha\gamma)_{ik}, e_{ijk} are assumed to be zero.

Under this model (and restrictions) E((\alpha\beta)^2) = \frac{a - 1}{a}\sigma^2_{\alpha\beta}, and E((\alpha\gamma)^2) = \frac{a - 1}{a}\sigma^2_{\alpha\gamma}. Furthermore, the covariance of the interactions between treatment and participant or between treatment and item for the same participant or item are -\frac{1}{a}\sigma^2_{\alpha\beta} for participants and -\frac{1}{a}\sigma^2_{\alpha\gamma} for items.

Within treatment variance

In order to obtain an expectation for MOE, we take the expected mean squares to get an expression or the expected within treatment variance \sigma^2_w. These expected means squares are presented in Table 1.

The expected within treatment variance can be found in the Treatment row in Table 1. It is comprised of all the components to the right of the component associated with the treatment effect (\theta^2_a). Thus, \sigma^2_w = \frac{q}{a}\sigma^2_{\alpha\beta} + \frac{p}{a}\sigma^2_{\alpha\gamma}+[\sigma^2_{\beta\gamma} + \sigma^2_e]. Note that the latter equals the sum of the expected mean squares of the Treatment by Participant (E(MS_{tp})) and the Treatment by Item (E(MS_{ti})) interactions, minus the expected mean square associated with Error (E(MS_e)).

Degrees of freedom

The second ingredient we need in order to obtain expected MOE are the degrees of freedom that are used to estimate the within treatment variance. In the ANOVA approach the within treatment variance is estimated by a linear combination of mean squares (as described in the last sentence of the previous section. This linear combination is also used to obtain approximate degrees of freedom using the Satterthwaite procedure:

    \[df =\frac{(E(MS_{tp}) + E(MS_{ti}) - E(MS_e)^2}{\frac{E(MS_{tp}))^2}{(a - 1)(p-a)}+\frac{E(MS_{ti})^2}{(a - 1)(q-a)}+\frac{E(MS_e)^2}{(p-a)(q-a)}}\]

Expected MOE

(Note: I can’t seem to get mathjax to generate align environments or equation arrays, so the following is ugly; Note to self: next time use R-studio or Lyx to generate R-html or an equivalent format).

The expected value of MOE for the contrasts in the counter balanced design is:

    \[E(MOE) = t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(\frac{1}{a}pq)^{-1}\sigma^2_w}\]

    \[= t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(\frac{1}{a}pq)^{-1}(\frac{q}{a}\sigma^2_{\alpha\beta} + \frac{p}{a}\sigma^2_{\alpha\gamma}+[\sigma^2_{\beta\gamma} + \sigma^2_e])}\]

    \[=t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(pq)^{-1}(q\sigma^2_{\alpha\beta} + p\sigma^2_{\alpha\gamma}+a[\sigma^2_{\beta\gamma} + \sigma^2_e])}\]

    \[=t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(\sigma^2_{\alpha\beta}/p + \sigma^2_{\alpha\gamma}/q +a[\sigma^2_{\beta\gamma} + \sigma^2_e]/pq)}\]

Finally an example

Suppose we the scores in three conditions are normally distributed with (total) variances \sigma^2_1 = \sigma^2_2 = \sigma^2_3 = 1.0. Suppose furthermore, that 10% of the variance can be attributed to treatment by participant interaction, 10% of the variance to the treatment by item interaction and 40% of the variance to the error confounded with the participant by item interaction. (which leaves 40% of the total variance attributable to participant and item variance.

Thus, we have \sigma^2_{tp} = E((\alpha\beta)^2) = .10, \sigma^2_{ti} = E((\alpha\gamma)^2) = .10, and \sigma^2_e = .40. Our target MOE is .25, and we plan to use the counterbalanced design with p = 30 participants, and q = 15 items (stimuli).

Due to the model restrictions presented above we have \sigma^2_{\alpha\beta} = \frac{a}{a - 1}\sigma^2_{tp} = \frac{3}{2}*.10 = .15, \sigma^2_{\alpha\gamma} = .15, and \sigma^2_e = .40.

The value of \sigma^2_w is therefore, 5*.15 + 10*.15 +.40 = 2.65, and the approximate df equal df = (2.65^2) / ((5*.15 + .40)^2/(2*27) + (10*.15+.40)^2/(2*12) + .40^2/(27*12)) = 74.5874.

For the first contrast, with weights {1, -1/2. -1/2}, then, the Expected value for the Margin of Error is E(MOE) = t_{.975}(74.5874)*\sqrt{(1.5 * (.15/30 + .15/15 + 3*.40/(30*15))} = 0.3243.

For the second contrast, with weights {0, 1, -1}, the Expected value of the Margin of Error is t_{.975}(74.5874)*\sqrt{(2*(.15/30 + .15/15 + 3*.40/(30*15))} = 0.3745

Thus, using p = 30 participants, and q = 15 items (stimuli) will not lead to an expected MOE larger than the target MOE of .25.

We can use the app to find the required number of participants and items for a given target MOE. If the number of groups is larger than two, the app uses the contrast estimate with the largest expected MOE to calculate the sample sizes (in the default setting the one comparing only two group means). The reasoning is that if the least precise estimate (in terms of MOE) meets our target precision, the other ones meet our target precision as well.

Using the app

I’ve included lot’ of comments in the app itself, but please ignore references to a manual (does not exist, yet, except in Dutch) or an article (no idea whether or not I’ll be able to finish the write-up anytime soon). I hope the app is pretty straightforward. Just take a look at  https://gmulder.shinyapps.io/precision/, but the basic idea is:
– Choose one of five designs
– Supply the number of treatment conditions
– Specify contrast(weights) (or use the default ones)
– Supply target MOE and assurance
– Supply values of variance components (read (e,g,) Westfall, et al, 2014, for more details).
– Supply a number of participants and items
– Choose run precision analysis with current values or
– Choose get sample sizes. (The app gives two solutions: one minimizes the number of participants and the other minimizes the number of stimuli/items). NOTE: the number of stimuli is always greater than or equal to 10 and the number of participants is always greater than or equal to 20.

An illustration

Take the example above. Out target MOE equals .25, and we want insurance of .80 to get an estimated MOE of no larger than .25. We use a counter-balanced design with three conditions, and want to estimate two contrasts: one comparing the first mean with the average of means two and three, and the other contrast compares the second mean with the third mean. We can use the default contrasts.
For the variance components, we use the default values provided by Westfall et al. (2014) for the variance components. These are also the default values in the app (so we don’t need to change anything now).
Let’s see what happens when we propose to use p = 30 participants and q = 15 items/stimuli.
Here is part of a screenshot from the app:
These results show that the expected MOE for the first contrast (comparing the first mean with the average of the other means) equals 0.3290, and assurance MOE for the same contrasts equals 0.3576. Remember that we specified the assurance as .80. So, this means that 80% of the replication experiments give estimated MOE as large as or smaller than 0.3576. But we want that to be at most 0.2500.  Thus, 30 participants and 15 items do suffice for our purposes.
Let’s use to app to get sample sizes. The results are as follows.

The app promises that using 25 stimuli combined with 290 participants or 25 participants and 290 items will do the trick (the symmetry of these results are due to the fact that the interaction components are equal; both the treatment by participant and the treatment by stimulus interaction component equal .10).  Since we have 3 treatment conditions using 290 participants or stimuli is a little awkward, so I suggest to use 291 (equals 97 participants per group or 97 items per list). (300 is a much nicer figure of course). Likewise, as it is hard to equally divide 25 stimuli or participants over three lists or groups, use a multiple of three (say: 27).

If we input the suggest sample sizes in the app, we see the following results if we choose the run precision analysis  with current values.

As you can see: Assurance MOE is close to 0.25 (.24) for the second contrast (the least precise one), so 80% of replication experiments will get estimated MOE of 0.25 (.24) or smaller. The expected precision is 0.22. The first contrast (which can be estimated with more precision) has assurance MOE of 0.21 and expected MOE of approximately 0.19.  Thus, the sample sizes lead to the results we want.

References

Cumming, G. (2012). Understanding the New Statistics. New York/London: Routledge.

Sahai, H., & Ageel, M. I. (2000). The analysis of variance. Fixed, Random, and Mixed Models. Boston/Basel/Berlin: Birkhäuser.

Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143(5), 2020-2045.