Planning with assurance, with assurance

Planning for precision requires that we choose a target Margin of Error (MoE; see this post for an introduction to the basic concepts) and a value for assurance, the probability that MoE will not exceed our target MoE.  What your exact target MoE will be depends on your research goals, of course.

Cumming and Calin-Jageman (2017, p. 277) propose a strategy for determining target MoE. You can use this strategy if your research goal is to provide strong evidence that the effect size is non-zero. The strategy is to divide the expected value of the difference by two, and to use that result as your target MoE.

Let’s restrict our attention to the comparison of two means. If the expected difference between the two means is Cohens’s d = .80, the proposed strategy is to set your target MoE at f = .40, which means that your target MoE is set at .40 standard deviations. If you plan for this value of target MoE with 80% assurance, the recommended sample size is n = 55 participants per group. These results are guaranteed to be true, if it is known for a fact that Cohen’s d is .80 and all statistical assumptions apply.

But it is generally not known for a fact that Cohen’s d has a particular value and so we need to answer a non-trivial question: what effect size can we reasonably expect? And, how can we have assurance that the MoE will not exceed half the unknown true effect size? One of the many options we have for answering this question is to conduct a pilot study, estimate the plausible values of the effect size and use these values for sample size planning.  I will describe a strategy that basically mirrors the sample size planning for power approach described by Anderson, Kelley, and Maxwell (2017).

The procedure is as follows. In order to plan with approximately 80% assurance, estimate on the basis of your pilot the 80% confidence interval for the population effect size and use half the value of the lower limit for sample size planning with 90% assurance. This will give you 81% assurance that assurance MoE is no larger than half the unknown true effect size.

The logic of planning with assurance, with assurance

There are two “problems” we need to consider when estimating the true effect size. The first problem is that there is at least 50% probability of obtaining an overestimate of the true effect size. If that happens, and we take the point estimate of the effect size as input for sample size planning, what we “believe” to be a sample size sufficient for 80% assurance will be a sample size that has less than 80% assurance at least 50% of the times. So, using the point estimate gives assurance MoE for the unknown effect size with less than 50% assurance.

To make it more concrete: suppose the true effect equals .80, and we use n = 25 participants in both groups of the pilot study, the probability is  approximately 50% that the point estimate is above .80. This implies, of course, that we will plan for a value of f > .40, approximately 50% of the times, and so the sample we get will only give us 80% assurance 50% of the times.

The second problem is that the small sample sizes we normally use for pilot studies may give highly imprecise estimates. For instance, with n = 25 participants per group, the expected MoE is f = 0.5687. So, even if we accept 50% assurance, it is highly likely that the point estimate is rather imprecise.

Since we are considering a pilot study,  one of the obvious solutions, increasing the sample size so that expected MoE is likely to be small, is not really an option. But what we can do is to use an estimate that is unlikely to be an overestimate of the true effect size. In particular, we can use as our estimate the lower limit of a confidence interval for the effect size.

Let me explain, by considering the 80% CI  of the effect size estimate. From basic theory it follows that the “true” value of the effect size will be smaller than the lower limit of the 80% confidence interval with probability  equal to 10%. That is, if we calculate a huge number of 80% confidence intervals, each time on the basis of new random samples from the population, the true value of the effect size will be below the lower limit in 10% of the cases. This also means that the lower limit of the interval has 90% probability to not overestimate the true effect size.

This means that  if we take the lower limit of the 80% CI of the pilot estimate as input for our sample size calculations, and if we plan with assurance of .90, we will have 90%*90% = 81% assurance that using the sample size we get from our calculations will have  MoE  no larger than half the true effect size. (Note that for 80% CI’s with negative limits you should choose the upper limit).

Sample Size planning based on a pilot study

Student of mine recently did a pilot study.  This was a pilot for an experiment investigating the size of the effect of fluency of delivery of a spoken message in a video on Comprehensibility, Persuasiveness and viewers’ Appreciation of the video. The pilot study used two groups of size n = 10, one group watched the fluent video (without ‘eh’) and the other group watched the disfluent video where the speaker used ‘eh’ a lot. The dependent variables were measured on 7-point scales.

Let’s look at the results for the Appreciation variable. The (biased) estimate of Cohen’s d (based on the pooled standard deviation) equals 1.09, 80% CI [0.46, 1.69] (I’ve calculated this using the ci.smd function from the MBESS-package. According to the rules-of-thumb for interpreting Cohen’s d, this can be considered a large effect. (For communication effect studies it can be considered an insanely large effect). However, the CI shows the large imprecision of the result, which is of course what we can expect with sample sizes of n = 10. (Average MoE equals f = 0.95, and according to my rules-of-thumb that is well below what I consider to be borderline precise).

If we use the lower limit of the interval (d = 0.46),  sample size planning with 90% assurance for half that effect (f = 0.23) gives us a sample size equal to n = 162. (Technical note: I planned  for the half-width of the standardized CI of the unstandardized effect size, not for the CI of the standardized effect size; I used my Shiny App for planning assuming an independent groups design with two groups).  As explained, since we used the lower limit of the 80% CI of the pilot and used 90% assurance in planning the sample size, the assurance that MoE will not exceed half the unknown true effect size equals 81%.

Sample size planning for precision: the basics

In this post, I will introduce some of the ideas underlying sample size planning for precision. The ideas are illustrated with a shiny-application which can be found here: The app illustrates the basic theory considering sample size planning for two independent groups. (If the app is no longer available (my allotted active monthly hours are limited on, contact me and I’ll send you the code).

The basic idea

The basic idea is that we are planning an experiment to estimate the difference in population means of an experimental and a control group. We want to know how many observations per group we have to make in order to estimate the difference between the means with a given target precision. 
Our measure of precision is the Margin of Error (MOE).  In the app, we specify our target MOE as a fraction (f) of the population standard deviation. However, we do not only specify our target MOE, but also our desired level of assurance. The assurance is the probability that our obtained MOE will not exceed our target MOE. Thus, if the assurance is .80 and our target MOE is f = .50, we have a probability of 80% that our obtained MOE will not exceed f = .50. 
The only part of the app you need for sample size planning is the “Sample size planning”-form. Specify f, and the assurance, and the app will give you the desired sample size. 
If you do that with the default values f = .50 and Assurance  = .80, the app will give you the following results on the Planning Results-tab:  Sample Size: 36.2175, Expected MOE (f): 0.46. This tells you that you need to sample 37 participants (for instance) per group and then the Expected MOE (the MOE you will get on average) will equal 0.46 (or even a little less, since you sample more than 36.2175 participants). 
The Planning-Results-tab also gives you a figure for the power of the t-test, testing the NHST nil-hypothesis for the effect size (Cohen’s d) specified in the “Set population values”-form. Note that this form, like the rest of the app provides details that are not necessary for sample size planning for precision, but make the theoretical concepts clear. So, let’s turn to those details. 

The population

Even though it is not at all necessary to specify the population values in detail, considering the population helps to realize the following. The sample size calculations and the figures for expected MOE and power, are based on the assumption that we are dealing with random samples from normal populations with equal variances (standard deviations). 
From these three assumptions, all the results follow deductively.  The following is important to realize:  if these assumptions do not obtain, the truth of the (statistical) conclusions we derive by deduction is no longer guaranteed. (Maybe you have never before realized that sample size planning involves deductive reasoning; deductive reasoning is also required for the calculation of p-values and to prove that 95% confidence intervals contain the value of the population parameter in 95% of the cases; without these assumptions is it uncertain what the true p-value is and whether or not the 95% confidence interval is in fact a 95% confidence interval).

In general, then, you should try to show (to others, if not to yourself) that it is reasonable to assume normally distributed populations, with equal variances and random sampling, before you decide that the p-value of your t-test, the width of your confidence interval, and the results of sample size calculations are believable.

The populations in the app are normal distributions. By default, the app shows two such distributions. One of the distributions, the one I like to think about as corresponding to the control condition, has μ = 0, the other one has μ = 0.5. Both distributions have a standard deviation (σ = 1). The standardized difference between the means is therefore equal to δ = 0.50.

The default populations are presented in Figure 1 below.

normal populations
Figure 1: Two normal distributions. The distribution to the left has μ = 0, the one to the right has μ = 0.5 The standard deviation in both distributions equals σ = 1. The standardized difference δ and the unstandardized difference between the means both equal 0.50. 

The sampling distribution of the mean difference 

The other default setting in the app is a sample size (per group) of n = 20.  From the sample size and the specification of the populations, we can deduce the probability density of the different values of the estimates of the difference between the population means. The estimate is simply the difference between the sample means.

This so-called sampling distribution of the mean difference is depicted on the tab next to the population. Figure 2 shows what the sampling distribution looks like if we repeatedly draw random samples of size n = 20 per group from our populations and keep track of the difference between the sample means we get in each repetition.

sampling distribution of difference
Figure 2: Sampling distribution of the difference between two sample means based on samples of n = 20 per group and random sampling from the populations described in Figure 1. 

Note that the mean of the sampling distribution equals 0.5 (as indicated by the middle vertical line). This is of course the (default) difference between the population means in the app. So, on average, estimates of the population difference equal the population difference.

The lines to the left and the right of the mean indicate the mean plus or minus the Margin of Error (MOE). The values corresponding to the lines are 0.5 ± MOE. 95% of estimates of the population mean difference have a value between these lines.

Conceptually, the purpose of planning for precision is to decrease the (horizontal) distance between these lines and the population mean difference. In other words, we would like the left and right lines as close to the mean of the distribution as is practically acceptable and possible.

The distribution of the t-statistic 

The tab next to the sampling distribution tab contains a figure representing the sampling distribution of the t-statistic. The sampling distribution of t can be deduced on the basis of the population values and the sample size.  In the app, it is assumed that t is calculated under the assumption that the null-hypothesis of zero difference between the means is true. The sampling distribution of t is what you get if you repeatedly sample from the populations as specified, calculate the t-statistic and keep a record of the values of the t-statistic.

The sampling distribution of the t-statistic presented in Figure 3 contains two vertical lines. These lines are located (horizontally) on the value of t that would lead to rejection of the null-hypothesis of equal population means. In other words, the lines are located at the critical value of t (for a two-tailed test).

distribution of t
Figure 3: Distribution of the t-statistic testing the null-hypothesis of equal population means. The distribution is based on sampling from the populations described in Figure 3. The sample size is n = 20 per group. The lines represent the critical value of t for a two sided t-test. The area between the vertical lines is the probability of a type II error. The combined areas to the left of the left line and to the right of the right line is the power of the test. 

The area between the lines is the probability that the null-hypothesis will not be rejected. In the case of a true population mean difference (which is the default assumption in the app), that probability is the probability of an error of the second kind: a type II error.

The complement of that probability is called the power of the test. This is, of course, the area to the left of the left vertical line added to the area to the right of the right vertical line. Conceptually, the power of the test is the probability of rejecting the null-hypothesis when in fact it is false.

Figure 3 clearly demonstrates that if the true mean difference equals 0.50 and the sample size (per group) equals n = 20, that there is a large probability that the null-hypothesis will not be rejected. Actually, the probability of a type II error equals .66. (So, the power of the test is .34).

Sample size planning for precision

With respect to sample size planning for precision, the app by default takes half of a standard deviation (f = .50) as the target MOE. Besides, planning is with 80% assurance. This means that the default settings search for a sample size (per group), so that with 80%  probability MOE will not exceed 0.50 (Note that the default value of the standard deviation is 1, so an f of .50 corresponds to a target MOE of  0.50 on the scale of the data; Likewise, were the standard deviation equal to 2, an f of .50 would correspond to a target MOE of 1.0).

As described above, planning with the default values gives us a sample size of  n = 37 per group, with an expected MOE of 0.46. In the tab next to the planning results, a figure displays what you can expect to find on average, given the planned sample size and the specification of the population. That figure is repeated here as Figure 4.

Expected results
Figure 4: Expected results in terms of point and interval estimates (95% confidence intervals). This is what you will find on average given the population specification in Figure 1 and using the default values for sample size planning. 

Figure 4 displays point and interval estimates of the group means and the difference between the means. The interval estimates are 95% confidence intervals. The figure clearly shows that on average, our estimate of the difference is very imprecise. That is, the expected 95% confidence interval ranges from almost 0 (0.50 – 0.46 = 0.04) to almost 1 (0.50 + 0.46 = 0.96). Of course, using n = 20, would be worse still.

A nice thing about the app (well, I for one think it’s pretty cool) is that as soon as you ask for the sample sizes, the sample size in the set population values form is automatically updated. Most importantly, this will also update the sampling distribution graphs of the difference between the means and the t-statistic. So, it provides an excellent way of showing what the updated sample size means in terms of MOE and the power of the t-test.

Let’s have a look at the sampling distribution of the mean difference, see Figure 5.

Sampling distribution of the difference.
Figure 5: Sampling distribution of the mean difference with n = 37 per group. Compare with Figure 2 to see the (small) difference in the Margin of Error compared to n = 20.  

If you compare Figures 5 and 2, you see that the vertical lines corresponding to the mean plus and minus MOE have shifted somewhat towards the mean. So here you can see, that almost doubling the sample size (from 20 to 37) had the desired effect of making MOE smaller.

I would like to point out the similarity between the sampling distribution of the difference and the expected results plot in Figure 4. If you look at the expected results for our estimate of the population difference, you see that the point estimate corresponds to the mean of the sampling distribution, which is of course equal to the populations mean difference and that the limits of the expected confidence interval correspond to the left and right vertical lines in Figure 5. Thus, on average the limits of the confidence interval correspond to the values that mark the middle 95% of the sampling distribution of the samples mean difference.

Since we specified an assurance of 80%, there is an 80% probability that in repeated sampling from the populations (see Figure 1) with n = 37 per group, our (estimated) MOE will not exceed half a standard deviation. Thus, whatever the true value of the populations mean difference is, there is a high probability that our estimate will not be more than half a standard deviation away from the mean. This is, I think, one of the major advantages of sample size planning for precision: we do not have to specify the unknown population mean difference. This is in contrast to sample size planning for power, where we do have to specify a specific population mean difference.

Speaking of power, the results of the sample size planning suggest that for our specification of the populations mean difference (Cohen’s delta = 0.50) the power of the test equals 0.56. Thus, there is a probability of 56% that with n = 37 per group the t-test will reject. The probability of a type II error is therefore 44%.

Figure 6 shows the distribution of the t statistic with n = 37 per group and a standardized effect size of 0.50.

Distribution of the t statistic
Figure 6. The distribution of the t-statistic testing the null-hypothesis of equal population means. The distribution is based on the population specification in Figure 1 and sample sizes of n = 37 per group, with true effect size equal to 0.50. The probability of a type II error is the area of under the curve between the two vertical lines. The power is the area under the curve beyond the two lines. Compare with Figure 3 to see the differences in these probabilities compared to n = 20.

Power versus precision

Now suppose that the unstandardized mean difference between the population means equals 2 and that the standard deviation equals 2.5.  I just filled in the set population values form, setting the mean of population 2 to 2.0 and the standard deviation to 2.5. And I clicked set values.

Let us plan for a target MOE of  f = 0.5 standard deviations with 80% assurance. Click get sample sizes in the sample size planning form. In this case, target MOE equals 1.25.

The results are not very surprising. Since the f did not change compared to the previous time, the results as regards the sample size are exactly the same. We  need n = 37. Again, this is what I like about sample size planning, no matter what the unknown situation in the population is, I just want my margin of error to be no more than half a standard deviation (for example).

But the power did change (of course). Since the standardized population mean difference is now 0.80 (= 2.0 / 2.5) in stead of 0.50, and all the other specifications remained the same, the power increases from 56% to 92%. That’s great.

However, the high probability of rejecting the null-hypothesis does not mean that we get precise estimates. On average, the point estimate of the difference equals 2 and the 95% confidence limits are  0.85 and 3.15 (the point estimate plus or minus 0.46 times the standard deviation of 2.5). See Figure 7.

Expected results large standardized effect
Figure 7: Expected results using n = 37 when sampling from two normal populations with equal standard deviations (σ = 2.5) and mean difference of 2.0. The standardized effect size equals 0.80. Note the imprecision of the estimates even though the power of the t-test equals .92.

In short, even though there is a high probability of  (correctly) rejecting the null-hypothesis of equal population means, we are still not in the position to confidently conclude what the size of the difference is: the expected confidence interval is very wide. 

Planning for precision with samples of participants and items

Many experiments involve the (quasi-)random selection of both participants and items. Westfall et al. (2014) provide a Shiny-app for power-calculations for five different experimental designs with selections of participants and items. Here I want to present my own Shiny-app for planning for precision of contrast estimates (for the comparison of up to four groups) in these experimental designs.  The app can be found here:

(Note: I have taken the code of Westfall’s app and added code or modified existing code to get precision estimates in stead of power; so, without Westfall’s app, my own modified version would never have existed).

The plan for this post is as follows. I will present the general theoretical background (mixed model ANOVA combined with ideas from Generalizability Theory) by considering comparing three groups in a counter balanced design.
Note 1: This post uses mathjax, so it’s probably unreadable on mobile devices. Note: a (tidied up) version (pdf) of this post can be downloaded here: download the pdf
Note 2: For simulation studies testing the procedure go here:
Note 3: I use the terms stimulus and item interchangeably; have to correct this to make things more readable and comparable to Westfall et al. (2014).
Note 4: If you do not like the technical details you can skip to an illustration of the app at the end of the post.

The general idea

The focus of planning for precision is to try to minimize the half-width of a 95%-confidence interval for a comparison of means (in our case). Following Cumming’s (2012) terminology I will call this half-width the Margin of Error (MOE). The actual purpose of the app is to find required sample sizes for participants and items that have a high probability (‘assurance’) of obtaining a MOE of some pre-specified value.

Expected MOE for a contrast

For a contrast estimate \hat{\psi}  we have the following expression for the expected MOE. 

    \[E(MOE) = t_{.975}(df)*\sigma_{\hat{\psi}},\]

where \sigma_{\hat{\psi}} is the standard error of the contrast estimate. Of course, both the standard error and the df are functions of the sample sizes.

For the standard error of a contrast with contrast weights c_i through c_a, where a is the number of treatment conditions,  we use the following general expression.

    \[\sigma_{\hat{\psi}} = \sqrt{\sum c^2_i \frac{\sigma^2_w}{n}},\]

where n is the per treatment sample size (i.e. the number of participants per treatment condition times the number of items per treatment condition) and \sigma^2_w the within treatment variance (we assume homogeneity of variance).

For a simple example take an independent samples design with n = 20 participants responding to 1 item in one of two possible treatment conditions (this is basically the set up for the independent t-test). Suppose we have contrast weights c_1 = 1 and c_2 = -1, and \sigma^2_w = 20, the standard error for this contrast equals \sigma_{\hat{\psi}} = \sqrt{\sum c^2_i\frac{\sigma^2_w}{n}} = \sqrt{2*20/20} = \sqrt{2}.  (Note that this is simply the standard error of the difference between two means as used in the independent samples t-test).

In this simple example, df is the total sample size (N = n*a) minus the number of treatment conditions (a), thus df =  N - a = 38. The expected MOE for this design is therefore, E(MOE) = t_{.975}(38)*\sqrt{2} = 2.0244*1.4142 = 2.8629. Note that using these figures entails that 95% of the contrast estimates will take values between the true contrast value plus and minus the expected MOE: \psi \pm 2.8629.

For the three groups case, and contrast weights {1, -1/2, -1/2}, the same sample sizes and within treatment variance gives E(MOE) = t_{.975}(57)*\sqrt{1.5*\frac{20}{20}} =  2.4525.

(If you like, I’ve written a little document with derivation of the variance of selected contrast estimates in the fully crossed design for the comparison of two and three group means. That document can be found here:

The focus of planning for precision is to try to find sample sizes that minimize expected MOE to a certain target MOE.  The app uses an optimization function that minimizes the squared difference between expected MOE and target MOE to find the optimal (minimal) sample sizes required.

Planning with assurance

If the expected MOE is equal to target MOE,  the sample estimate of MOE will be larger than your target MOE in 50% of replication experiments. This is why we plan with assurance (terminology from Cumming, 2012).  For instance, we may want to have a 95% probability (95% assurance) that the estimated MOE will not exceed our target MOE.

In order to plan with assurance, we need (an approximation of) the sampling distribution of MOE. In the ANOVA approach that underlies the app, this boils down to the distribution of estimates of \sigma^2_w

    \[MS_w \sim \sigma^2_w*\chi^2(df)/df,\]


    \[\hat{MOE} \sim t_{.975}(df)\sqrt{\frac{1}{n}\sum{c_i^2}\sigma^2_w*\chi^2(df)/df}.\]

In terms of the two-groups independent samples design above: the expected MOE equals 2.8629. But, with df = 38, there is an 80% probability (assurance) that the estimated MOE will be no larger than:

    \[\hat{MOE}_{.80}  = 2.0244 * \sqrt{1/20*2*20*45.07628/38} = 3.1181.\]

Note that the 45.07628 is the quantile q_{.80} in the chi-squared (df = 38) distribution. That is P(\chi^2(38) \leq 45.07628) = .80.

The app let’s  you specify a target MOE and a value for the desired assurance (\gamma) and will find the combination of number of participants and items that will give an estimated MOE no larger than target MOE in \gamma% of the replication experiments.

The mixed model ANOVA approach

Basically, what we need to plan for precision is to able to specify \sigma^2_w and the degrees of freedom. We will specify \sigma^2_w as a function of variance components and use the Satterthwaite procedure to approximate the degrees of freedom by means of a linear combination of expected mean squares. I will illustrate the approach with a three-treatment conditions counterbalanced design.

A description of the design

Suppose we are interested in estimating the differences between three group means. We formulate two contrasts: one contrast estimates the mean difference between the first group and the average of the means of the second and third groups. The weights of the contrasts are respectively {1, -1/2, -1/2}, and {0, 1, -1}.

We are planning to use a counterbalanced design with a number of participants equal to p and a sample of items of size q. In the design we randomly assign participants to a groups, where a is the number of conditions, and randomly assign items to a lists (see Westfall et al., 2014 for more details about this design). All the groups are exposed to all lists of stimuli, but the groups are exposed to different lists in each condition. The number of group by list combinations equals a^2, and the number of observations in each group by list combination equals \frac{1}{a^2}pq. The condition means are estimated by combining a group by list combinations each of which composed of different participants and stimuli. The total number of observations per condition is therefore, \frac{a}{a^2}pq = \frac{1}{a}pq.

The ANOVA model

The ANOVA model for this design is

    \[Y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_k +  (\alpha\beta)_{ij} + (\alpha\gamma)_{ik} + e_{ijk},\]

where the effect \alpha_i is a constant treatment effect (it’s a fixed effect), and the other effect are random effects with zero mean and variances \sigma^2_\beta (participants), \sigma^2_\gamma (items), \sigma^2_{\alpha\beta} (person by treatment interaction), \sigma^2_{\alpha\gamma} (item by treatment interaction) and \sigma^2_e (error variance confounded with the person by item interaction). Note: in Table 1 below, \sigma^2_e is (for technical reasons not important for this blogpost) presented as this confounding [\sigma^2_{\beta\gamma} + \sigma^2_e].

We make use of the following restrictions (Sahai & Ageel, 2000): \sum_{i = 1}^a = 0, and \sum_{i=1}^a(\alpha\beta)_{ij} = \sum_{i=1}^a(\alpha\gamma)_{ik} = 0. The latter two restrictions make the interaction-effects correlated across conditions (i,e. the effects of person and treatment are correlated across condition for the same person, likewise the interaction effects of item and treatment are correlated across conditons for the same item. Interaction effects of different participants and items are uncorrelated). The covariances between the random effects \beta_j, \gamma_i, (\alpha\beta)_{ij}, (\alpha\gamma)_{ik}, e_{ijk} are assumed to be zero.

Under this model (and restrictions) E((\alpha\beta)^2) = \frac{a - 1}{a}\sigma^2_{\alpha\beta}, and E((\alpha\gamma)^2) = \frac{a - 1}{a}\sigma^2_{\alpha\gamma}. Furthermore, the covariance of the interactions between treatment and participant or between treatment and item for the same participant or item are -\frac{1}{a}\sigma^2_{\alpha\beta} for participants and -\frac{1}{a}\sigma^2_{\alpha\gamma} for items.

Within treatment variance

In order to obtain an expectation for MOE, we take the expected mean squares to get an expression or the expected within treatment variance \sigma^2_w. These expected means squares are presented in Table 1.

The expected within treatment variance can be found in the Treatment row in Table 1. It is comprised of all the components to the right of the component associated with the treatment effect (\theta^2_a). Thus, \sigma^2_w = \frac{q}{a}\sigma^2_{\alpha\beta} + \frac{p}{a}\sigma^2_{\alpha\gamma}+[\sigma^2_{\beta\gamma} + \sigma^2_e]. Note that the latter equals the sum of the expected mean squares of the Treatment by Participant (E(MS_{tp})) and the Treatment by Item (E(MS_{ti})) interactions, minus the expected mean square associated with Error (E(MS_e)).

Degrees of freedom

The second ingredient we need in order to obtain expected MOE are the degrees of freedom that are used to estimate the within treatment variance. In the ANOVA approach the within treatment variance is estimated by a linear combination of mean squares (as described in the last sentence of the previous section. This linear combination is also used to obtain approximate degrees of freedom using the Satterthwaite procedure:

  1.     \[df =\frac{(E(MS_{tp}) + E(MS_{ti}) - E(MS_e))^2}{\frac{E(MS_{tp})^2}{(a - 1)(p-a)}+\frac{E(MS_{ti})^2}{(a - 1)(q-a)}+\frac{E(MS_e)^2}{(p-a)(q-a)}}\]

Expected MOE

(Note: I can’t seem to get mathjax to generate align environments or equation arrays, so the following is ugly; Note to self: next time use R-studio or Lyx to generate R-html or an equivalent format).

The expected value of MOE for the contrasts in the counter balanced design is:

    \[E(MOE) = t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(\frac{1}{a}pq)^{-1}\sigma^2_w}\]

    \[= t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(\frac{1}{a}pq)^{-1}(\frac{q}{a}\sigma^2_{\alpha\beta} + \frac{p}{a}\sigma^2_{\alpha\gamma}+[\sigma^2_{\beta\gamma} + \sigma^2_e])}\]

    \[=t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(pq)^{-1}(q\sigma^2_{\alpha\beta} + p\sigma^2_{\alpha\gamma}+a[\sigma^2_{\beta\gamma} + \sigma^2_e])}\]

    \[=t(df)*\sqrt{(\sum_{i=1}^a c^2_i)(\sigma^2_{\alpha\beta}/p + \sigma^2_{\alpha\gamma}/q +a[\sigma^2_{\beta\gamma} + \sigma^2_e]/pq)}\]

Finally an example

Suppose we the scores in three conditions are normally distributed with (total) variances \sigma^2_1 = \sigma^2_2 = \sigma^2_3 = 1.0. Suppose furthermore, that 10% of the variance can be attributed to treatment by participant interaction, 10% of the variance to the treatment by item interaction and 40% of the variance to the error confounded with the participant by item interaction. (which leaves 40% of the total variance attributable to participant and item variance.

Thus, we have \sigma^2_{tp} = E((\alpha\beta)^2) = .10, \sigma^2_{ti} = E((\alpha\gamma)^2) = .10, and \sigma^2_e = .40. Our target MOE is .25, and we plan to use the counterbalanced design with p = 30 participants, and q = 15 items (stimuli).

Due to the model restrictions presented above we have \sigma^2_{\alpha\beta} = \frac{a}{a - 1}\sigma^2_{tp} = \frac{3}{2}*.10 = .15, \sigma^2_{\alpha\gamma} = .15, and \sigma^2_e = .40.

The value of \sigma^2_w is therefore, 5*.15 + 10*.15 +.40 = 2.65, and the approximate df equal df = (2.65^2) / ((5*.15 + .40)^2/(2*27) + (10*.15+.40)^2/(2*12) + .40^2/(27*12)) = 74.5874.

For the first contrast, with weights {1, -1/2. -1/2}, then, the Expected value for the Margin of Error is E(MOE) = t_{.975}(74.5874)*\sqrt{(1.5 * (.15/30 + .15/15 + 3*.40/(30*15))} = 0.3243.

For the second contrast, with weights {0, 1, -1}, the Expected value of the Margin of Error is t_{.975}(74.5874)*\sqrt{(2*(.15/30 + .15/15 + 3*.40/(30*15))} = 0.3745

Thus, using p = 30 participants, and q = 15 items (stimuli) will not lead to an expected MOE larger than the target MOE of .25.

We can use the app to find the required number of participants and items for a given target MOE. If the number of groups is larger than two, the app uses the contrast estimate with the largest expected MOE to calculate the sample sizes (in the default setting the one comparing only two group means). The reasoning is that if the least precise estimate (in terms of MOE) meets our target precision, the other ones meet our target precision as well.

Using the app

I’ve included lot’ of comments in the app itself, but please ignore references to a manual (does not exist, yet, except in Dutch) or an article (no idea whether or not I’ll be able to finish the write-up anytime soon). I hope the app is pretty straightforward. Just take a look at, but the basic idea is:
– Choose one of five designs
– Supply the number of treatment conditions
– Specify contrast(weights) (or use the default ones)
– Supply target MOE and assurance
– Supply values of variance components (read (e,g,) Westfall, et al, 2014, for more details).
– Supply a number of participants and items
– Choose run precision analysis with current values or
– Choose get sample sizes. (The app gives two solutions: one minimizes the number of participants and the other minimizes the number of stimuli/items). NOTE: the number of stimuli is always greater than or equal to 10 and the number of participants is always greater than or equal to 20.

An illustration

Take the example above. Out target MOE equals .25, and we want insurance of .80 to get an estimated MOE of no larger than .25. We use a counter-balanced design with three conditions, and want to estimate two contrasts: one comparing the first mean with the average of means two and three, and the other contrast compares the second mean with the third mean. We can use the default contrasts.
For the variance components, we use the default values provided by Westfall et al. (2014) for the variance components. These are also the default values in the app (so we don’t need to change anything now).
Let’s see what happens when we propose to use p = 30 participants and q = 15 items/stimuli.
Here is part of a screenshot from the app:
These results show that the expected MOE for the first contrast (comparing the first mean with the average of the other means) equals 0.3290, and assurance MOE for the same contrasts equals 0.3576. Remember that we specified the assurance as .80. So, this means that 80% of the replication experiments give estimated MOE as large as or smaller than 0.3576. But we want that to be at most 0.2500.  Thus, 30 participants and 15 items do suffice for our purposes.
Let’s use to app to get sample sizes. The results are as follows.

The app promises that using 25 stimuli combined with 290 participants or 25 participants and 290 items will do the trick (the symmetry of these results are due to the fact that the interaction components are equal; both the treatment by participant and the treatment by stimulus interaction component equal .10).  Since we have 3 treatment conditions using 290 participants or stimuli is a little awkward, so I suggest to use 291 (equals 97 participants per group or 97 items per list). (300 is a much nicer figure of course). Likewise, as it is hard to equally divide 25 stimuli or participants over three lists or groups, use a multiple of three (say: 27).

If we input the suggest sample sizes in the app, we see the following results if we choose the run precision analysis  with current values.

As you can see: Assurance MOE is close to 0.25 (.24) for the second contrast (the least precise one), so 80% of replication experiments will get estimated MOE of 0.25 (.24) or smaller. The expected precision is 0.22. The first contrast (which can be estimated with more precision) has assurance MOE of 0.21 and expected MOE of approximately 0.19.  Thus, the sample sizes lead to the results we want.


Cumming, G. (2012). Understanding the New Statistics. New York/London: Routledge.

Sahai, H., & Ageel, M. I. (2000). The analysis of variance. Fixed, Random, and Mixed Models. Boston/Basel/Berlin: Birkhäuser.

Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143(5), 2020-2045.