## The independent and paired T-test in jamovi

This is a short tutorial on comparing two means with the independent and paired t-test in jamovi.

## Description of the data

I have downloaded the dataset from the Mathematics and Statistics Help (MASH) site. The dataset contains data on 78 persons following one of three diets. I will use the dataset to show you how to estimate the difference between two means with the independent t-test analysis and the dependent t-test analysis in jamovi. I will ignore the different diets and focus on gender differences and pre- and post-weight differences instead.

I will focus on two substantive research questions.

1. To what extent does following a diet lead to weight loss?
2. To what extent does the weight lost differ between females and males?

## The paired t-test in jamovi

Let’s start with the first research question. Statistical analysis can be useful for answering the research question, but then we will first have to translate the substantive question into a statistical one. Now, it seems reasonable to assume that if following a diet leads to weight loss, that the typical weight before the diet differs from the typical weight after following the diet. If we assume furthermore that the mean provides a good representation of what is typical, then it is plausible that the mean weight before the diet differs from the mean weight after the diet.

If we are interested in the extent to which means differ, we are – statistically speaking – interested in the extent to which expected values differ. So, our analysis will focus on finding out what our data have to say about the difference between the expected values. More concrete: we focus on the difference between the expected values for weight (measured in kg) before and after the diet. Using conventional symbols, we aim at uncovering quantitative information about the difference \mu_{pre\ weight} – \mu_{post\ weight}.

Since all persons were measured pre and post diet, the measurements are likely to be correlated. Indeed, the sample correlation equals r = .96. We need to take this correlation into account and that is why we use the statistical techniques for estimation and testing that are available in the paired t-test analysis in jamovi.

### Doing the paired t-test in jamovi

In a real research situation, we would of course start with descriptive analyses to figure out what the data seem to suggest about the extent to which a diet leads to weight loss. But now we are just looking at how to obtain the relevant inferential information from jamovi.

I have chosen the following options for the analysis.

Since we are interested in the extent to which following a diet leads to weight loss, it is important to realize that the t-test in itself does not necessarily provide useful information. Why? Because the t-test gives us input to make the decision whether to reject the null-hypothesis that the expected values are equal, and only indirectly provides us with the information about the extent to which the expected values differ, and the latter is of course what we are interested in: we want quantitative information! The more useful information is provided by the estimate of the mean difference and the 95% Confidence Interval.

### The paired t-test output

The relevant output for the t-test and the estimation results are presented in Figure 2.

Let’s start with the t-test. The conventional null-hypothesis is that the expected values of the two variables are equal ( \mu_{pre\ weight} = \mu_{post\ weight}, or \mu_{pre\ weight} – \mu_{post\ weight} = 0. . The alternative hypothesis is that the expected values are not equal. Following convention, we use a significance level of \alpha = .05, so that our decision rule is to reject the null-hypothesis if the p-value is smaller than .05 and to not reject but also not accept the null-hypothesis if the p-value is .05 or larger.

The result of the t-test is t(77) = 13.3, p < .001. Since the p-value is smaller than .05 we reject the null-hypothesis and we decide that the expected values are not equal. In other words, we decide that the population means are not equal. As we said above, this does not answer our research question, so we’d better move on to the estimation results.

The estimated difference between the expected values equals 3.84 kg, 95% CI [3.27, 4.42]. So, we estimate that the difference in expected weights after 10 weeks of following the diet is somewhere between 3.72 and 4.42 kilograms.

### Cohen’s d for the paired design

Jamovi also provides point and interval estimates for Cohen’s d. This version of Cohen’s d is derived by standardizing the mean difference using the standard deviation of the difference scores. Figure 3 presents the relevant output. The estimated value of Cohen’s delta equals 1.51, 95% CI [1.18, 1.83]. Using rules of thumb for the interpretation of Cohen’s d, these results suggest that there may be a very large difference in mean weights of the pre- and postdiet measurements.

There is an alternative conceptualisation of the standardized mean difference. Instead of using the SD of the difference scores, we may use the average of the SDs of the two measurements. See https://small-s.science/2020/12/cohens-d-for-paired-designs/ for an explanation and R-code for the calculation of the CI.

## The independent t-test in jamovi

To answer the second research question, we will have to reframe the substantive question into a statistical one. Just like we assumed above, we will consider the difference between the expected values (or population means) to be the statistical quantity of interest.

The conventional statistical null-hypothesis of the t-test is that the expected value of the variable does not differ between the groups or conditions. In other words, the null-hypothesis is that the two population means are equal. If the test result is significant, we will reject the null-hypothesis and decide that the population means are not equal. Note that this does not really answer the research question. Indeed, we are interested in the extent to which the expected values differ and not in whether we can decide that the difference is not zero. For that reason, estimation results are usually more informative than the results of a significance test.

### Doing the independent t-test in jamovi.

I have chosen the following options for the independent t-test analysis in jamovi.

The above options will give you the results of the independent t-test and, more importantly, the estimation results, both unstandardized and standardized (Cohen’s d).

### The independent t-test output

The relevant output is presented in Figure 5.

The result of the t-test is t(74) = -0.21, p = 0.84. This test result is clearly not significant, so we cannot decide that the population means differ. Importantly, we can also not decide that the population means are equal. That would be an instance of accepting the null-hypothesis and that is not allowed in NHST.

The estimation results (i.e. -0.12 kg, 95% CI [-1.29, 1.04]) make it clear that we should not necessarly believe that the population means are equal. Indeed, even though the estimated difference is only 0.12 kilograms, the CI shows the data to be consistent with differences up to 1 kg in either direction, i.e. with women showing more average weight loss than men or women showing less average weight loss than men.

### Cohen’s d for the independent design

We can also find the standardized mean difference and its CI in the independent t-test output: Cohen’s d = -0.08, 95% CI [-0.50, 0.41]. In this case, Cohen’s d is based on the pooled standard deviaton. According to rules-of-thumb that are used frequently in psychology the estimated effect is negligible to small, but the CI shows the data to be consistent with medium effect sizes in either direction.

## A confidence interval for the correlation coefficient

A confidence interval for the  population correlation coefficient \rho can be obtained with the Fisher-r-to-z transformation.   The steps are as follows.

1. Transform r to a standard normal deviate Z
Z_{xy} = \frac{1}{2}ln\left(\frac{1 + r}{1  –  r}\right), \tag{1}
which is equal to:
Z_{xy} = arctanh(r). \tag{2}
2. Determine the standard error for Z:
s_Z = \sqrt\frac{1}{N  –  3}. \tag{3}
3. Calculate the Margin of Error (MoE) for Z:
MOE_Z = 1.96*s_z. \tag{4}
4. Add to and substract MoE  from Z to obtain a 95% Confidence Interval for Z.
5. Transform the upper and lower limits of the CI for Z to obtain the corresponding limits for \rho, using:
r_Z = \frac{e^{2Z} –  1}{e^{2Z} + 1}, \tag{4}
which is equal to:
r_Z = tanh(Z). \tag{5}

The following R-code does all the work:

conf.int.rho <- function(r, N) {
lims.rho =  tanh(atanh(r) + c(qnorm(.025),
qnorm(.975)) * sqrt(1/(N - 3)))
return(lims.rho)
}



So, if you have r = .50 and N = .50, just run the above function in R to obtain a confidence interval for the correlation coefficient.

conf.int.rho(.50, 50)

## [1] 0.2574879 0.6832563


## Cohen’s d for paired designs

For the paired design, which is traditionally used to obtain data for the paired t-test, we can calculate a standardized mean difference, Cohen’s d, using the average of the standard deviations of the two conditions. Cohen’s d for paired designs can be calculated as follows.

d_{av} =\frac{(M_1 - M_2)}{s_{av}}, \tag{1}

where s_{av} equals

s_{av}= \sqrt{\frac{1}{2}(S^2_1+S^2_2)}. \tag{2}

Now, (1) is of course an estimate of (3), the population value of Cohen’s d for paired designs, but we do not only need a point estimate, but also a confidence interval.

The following R-code can be used to obtain a 95% confidence interval for the estimate of \delta_{av}, the population mean difference standardized by using the average of the two standard deviations:

\delta_{av} = \frac{\mu_1 - \mu_2}{\sqrt{\frac{1}{2}(\sigma_1^2 + \sigma_2^2)}} , \tag{3}

This procedure uses the approximate procedure by Algina & Keselman (2003), which is also used by ESCI (Cumming, 2012; Cumming and Calin-Jageman, 2017), as Kline (2013) explains. The following steps are performed to obtain the 95% confidence interval.

1. Use the obtained t-value of the paired t-test to estimate the non-centrality parameter \lambda. Steps 2 and 3 are for calculating a 95% confidence interval for the non-centrality parameter.
2. Use an iterative procedure to find the non-centrality parameter of the t-distribution for which the observed t-value is the .025 quantile. This is the upper limit of the confidence interval for the non-centrality parameter.
3. Use an iterative procedure to find the non-centrality parameter of the t-distribution for which the observed t-value is the .975 quantile. This is the lower limit of the confidence interval for the non-centrality parameter.
4. To obtain a CI for \delta_{av} multiply the limits of the confidence interval for the non-centrality parameter by the value \sqrt{\frac{2S_D^2}{n(S_1^2+S_2^2)}} where n is the sample size, S_D^2 the variance of the difference scores , and S_1^2 and S_2^2 the variances of the two variables.

The following R-function does all the work. Note that with large potential values for the noncentrality parameter R issues warnings that “full precision has not been achieved in ‘pnt{final}'”. These warnings can be ignored ( I checked many examples against ESCI’s output), but in order to prevent them, I let the optimize function search from the observed value of t to maximally five times its value, and I have included the option to suppress warnings or not (just set warn = TRUE to get the warnings).

ci.d.av <- function(t, n, s.1, s.2, s.diff, warn = FALSE) {
if (t < 1) {
lims <- c(-5, 5)
} else {
lims <- c(-5*abs(t), 5*abs(t))
}
df = n - 1
multiplier = sqrt((2*s.diff^2) / (n*(s.1^2 +  s.2^2)) )
loss <- function(x, prob) (pt(t, df, x) - prob)^2
if (warn == FALSE) {
ul <- suppressWarnings(optimize(loss, lims, prob=.025))$minimum ll <- suppressWarnings(optimize(loss, lims, prob=.975))$minimum
} else {
ul <- optimize(loss, lims, prob=.025)$minimum ll <- optimize(loss, lims, prob=.975)$minimum
}
return(round(c(ll, ul), 4)*multiplier)
}


The arguments of the function are t, the t-value of the t-test testing the hypothesis of equal population means, the sample size (n), the standard deviations (s.1 and s.2) of the two variables and the standard deviation of the difference scores (s.diff).

## Calculating Cohen’s d for paired designs: an example

Here is a quick example.

library(MASS)
set.seed(1234)

# generate random multivariate normal data with sample size n = 20

theData <- mvrnorm(20, c(.5, .0), matrix(c(1, .8, .8, 1), 2, 2))

# calculate the standard deviations

sds <- apply(theData, 2, sd)

# calculate the standard deviation of difference scores

sDiff <- sd(theData[,1] - theData[,2])

# get t.value and a value for d.av
# here I use the output of the t-test in R to obtain t and the mean
# difference score (needed for calculating d.av)

theTest <- t.test(theData[,1], theData[,2], paired=TRUE)
t = theTest$statistic d.av = theTest$estimate / mean(sds)

ci.d.av(t = t, n = 20, s.1 = sds[1], s.2 = sds[2], sDiff)


The results are that the estimate equals d_{av} = 0.87, 95\% \text{CI} [0.51, 1.22].

Alternatively, we can make use of the conf.limits.nct function of the MBESS package (Kelley, 2007a, 2007b), and proceed as follows (using the data generated above).

library(MBESS)

ci.d.av.2 <- function(t, n, s.1, s.2, s.diff) {
df = n - 1
multiplier = sqrt((2*s.diff^2) / (n*(s.1^2 +  s.2^2)) )
unlist(conf.limits.nct(t, df)[c(1,3)])*multiplier
}

ci.d.av.2(t = t, n = 20, s.1 = sds[1], s.2 = sds[2], s.diff = sDiff).


## References

Algina, J. & Keselman, H. J. (2003). Approximate confidence intervals for effect sizes. Educational and Psychological Measurement, 63, 721-734.

Cumming, G. (2012). Understanding the New Statistics. Effect Sizes, Confidence Interval, and Meta-Analysis. New York: Routledge.

Cumming, G. & Calin-Jageman, R. (2017). Introduction fo the New Statistics. Estimation, Open Science, and Beyond. New York: Routledge.

Kelley, K. (2007b). Confidence intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software20(8), 1-24.

Kline, R.b. (2013). Beyond Significance Testing. Statistics Reform in the Behavioral Sciences. (Second Edition). Washington: APA.

## Linear Trend Analysis with R and SPSS

This is an introduction to contrast analysis for estimating the  linear trend among condition means with R and SPSS . The tutorial focuses on obtaining point and confidence intervals.  The contents of this introduction is based on Maxwell, Delaney, and Kelley (2017) and Rosenthal, Rosnow, and Rubin (2000). I have taken the (invented) data from Haans (2018). The estimation perspective to statistical analysis is aimed at obtaining point and interval estimates of effect sizes. Here, I will use the frequentist perspective of obtaining a point estimate and a 95% Confidence Interval of the relevant effect size. For linear trend analysis, the relevant effect size is the slope coefficient of the linear trend, so, the purpose of the analysis is to estimate the value of the slope and the 95% confidence interval of the estimate. We will use contrast analysis to obtain the relevant data.

[Note: A pdf-file that differs only slightly from this blogpost can be found on my Researchgate page: here; I suggest Haans (2018) for an easy to follow introduction to contrast analysis, which should really help understanding what is being said below].

The references cited above are clear about how to construct contrast coefficients (lambda coefficients) for linear trends (and non-linear trends for that matter) that can be used to perform a significance test for the null-hypothesis that the slope equals zero. Maxwell, Delaney, and Kelley (2017) describe how to obtain a confidence interval for the slope and make clear that to obtain interpretable results from the software we use, we should consider how the linear trend contrast values are scaled. That is, standard software (like SPSS) gives us a point estimate and a confidence interval for the contrast estimate, but depending on how the coefficients are scaled, these estimates are not necessarily interpretable in terms of the slope of the linear trend, as I will make clear
momentarily.

So our goal of the data-analysis is to obtain a point and interval estimate of the slope of the linear trend and the purpose of this contribution is to show how to obtain output that is interpretable as such.

## Planning with assurance, with assurance

Planning for precision requires that we choose a target Margin of Error (MoE; see this post for an introduction to the basic concepts) and a value for assurance, the probability that MoE will not exceed our target MoE.  What your exact target MoE will be depends on your research goals, of course.

Cumming and Calin-Jageman (2017, p. 277) propose a strategy for determining target MoE. You can use this strategy if your research goal is to provide strong evidence that the effect size is non-zero. The strategy is to divide the expected value of the difference by two, and to use that result as your target MoE.

Let’s restrict our attention to the comparison of two means. If the expected difference between the two means is Cohens’s d = .80, the proposed strategy is to set your target MoE at f = .40, which means that your target MoE is set at .40 standard deviations. If you plan for this value of target MoE with 80% assurance, the recommended sample size is n = 55 participants per group. These results are guaranteed to be true, if it is known for a fact that Cohen’s d is .80 and all statistical assumptions apply.

But it is generally not known for a fact that Cohen’s d has a particular value and so we need to answer a non-trivial question: what effect size can we reasonably expect? And, how can we have assurance that the MoE will not exceed half the unknown true effect size? One of the many options we have for answering this question is to conduct a pilot study, estimate the plausible values of the effect size and use these values for sample size planning.  I will describe a strategy that basically mirrors the sample size planning for power approach described by Anderson, Kelley, and Maxwell (2017).

The procedure is as follows. In order to plan with approximately 80% assurance, estimate on the basis of your pilot the 80% confidence interval for the population effect size and use half the value of the lower limit for sample size planning with 90% assurance. This will give you 81% assurance that assurance MoE is no larger than half the unknown true effect size.

## The logic of planning with assurance, with assurance

There are two “problems” we need to consider when estimating the true effect size. The first problem is that there is at least 50% probability of obtaining an overestimate of the true effect size. If that happens, and we take the point estimate of the effect size as input for sample size planning, what we “believe” to be a sample size sufficient for 80% assurance will be a sample size that has less than 80% assurance at least 50% of the times. So, using the point estimate gives assurance MoE for the unknown effect size with less than 50% assurance.

To make it more concrete: suppose the true effect equals .80, and we use n = 25 participants in both groups of the pilot study, the probability is  approximately 50% that the point estimate is above .80. This implies, of course, that we will plan for a value of f > .40, approximately 50% of the times, and so the sample we get will only give us 80% assurance 50% of the times.

The second problem is that the small sample sizes we normally use for pilot studies may give highly imprecise estimates. For instance, with n = 25 participants per group, the expected MoE is f = 0.5687. So, even if we accept 50% assurance, it is highly likely that the point estimate is rather imprecise.

Since we are considering a pilot study,  one of the obvious solutions, increasing the sample size so that expected MoE is likely to be small, is not really an option. But what we can do is to use an estimate that is unlikely to be an overestimate of the true effect size. In particular, we can use as our estimate the lower limit of a confidence interval for the effect size.

Let me explain, by considering the 80% CI  of the effect size estimate. From basic theory it follows that the “true” value of the effect size will be smaller than the lower limit of the 80% confidence interval with probability  equal to 10%. That is, if we calculate a huge number of 80% confidence intervals, each time on the basis of new random samples from the population, the true value of the effect size will be below the lower limit in 10% of the cases. This also means that the lower limit of the interval has 90% probability to not overestimate the true effect size.

This means that  if we take the lower limit of the 80% CI of the pilot estimate as input for our sample size calculations, and if we plan with assurance of .90, we will have 90%*90% = 81% assurance that using the sample size we get from our calculations will have  MoE  no larger than half the true effect size. (Note that for 80% CI’s with negative limits you should choose the upper limit).

## Sample Size planning based on a pilot study

Student of mine recently did a pilot study.  This was a pilot for an experiment investigating the size of the effect of fluency of delivery of a spoken message in a video on Comprehensibility, Persuasiveness and viewers’ Appreciation of the video. The pilot study used two groups of size n = 10, one group watched the fluent video (without ‘eh’) and the other group watched the disfluent video where the speaker used ‘eh’ a lot. The dependent variables were measured on 7-point scales.

Let’s look at the results for the Appreciation variable. The (biased) estimate of Cohen’s d (based on the pooled standard deviation) equals 1.09, 80% CI [0.46, 1.69] (I’ve calculated this using the ci.smd function from the MBESS-package. According to the rules-of-thumb for interpreting Cohen’s d, this can be considered a large effect. (For communication effect studies it can be considered an insanely large effect). However, the CI shows the large imprecision of the result, which is of course what we can expect with sample sizes of n = 10. (Average MoE equals f = 0.95, and according to my rules-of-thumb that is well below what I consider to be borderline precise).

If we use the lower limit of the interval (d = 0.46),  sample size planning with 90% assurance for half that effect (f = 0.23) gives us a sample size equal to n = 162. (Technical note: I planned  for the half-width of the standardized CI of the unstandardized effect size, not for the CI of the standardized effect size; I used my Shiny App for planning assuming an independent groups design with two groups).  As explained, since we used the lower limit of the 80% CI of the pilot and used 90% assurance in planning the sample size, the assurance that MoE will not exceed half the unknown true effect size equals 81%.

## Contrast Analysis with R for factorial designs: A Tutorial

In this post, I want to show how to do contrast analysis with R for factorial designs. We focus on a 2-way between subjects design. A tutorial for factorial within-subjects designs can be found here: https://small-s.science/2019/01/contrast-analysis-with-r-repeated-measures/ . A tutorial for mixed designs (combining within and between subjects factors can be found here: https://small-s.science/2019/04/contrast-analysis-with-r-mixed-design/.

I want to show how we can use R for contrast analysis of an interaction effect in a 2 x 4 between subjects design. The analysis onsiders the effect of students’ seating distance from the teacher and the educational performance of the students: the closer to the teacher the student is seated, the higher the performance. A “theory “explaining the effect is that the effect is mainly caused by the teacher having decreased levels of eye contact with the students sitting farther to the back in the lecture hall.

To test that theory, a experiment was conducted with N = 72 participants attending a lecture. The lecture was given to two independent groups of 36 participants. The first group attended the lecture while the teacher was wearing dark sunglasses, the second group attented the lecture while the teacher was not wearing sunglasses. All participants were randomly assigned to 1 of 4 possible rows, with row 1 being closest to the teacher and row 4 the furthest from the teacher The dependent variable was the score on a 10-item questionnaire about the contents of the lecture. So, we have a 2 by 4 factorial design, with n = 9 participants in each combination of the factor levels.

Here we focus on obtaining an interaction contrast: we will estimate the extent to which the difference between the mean retention score of the participants on the first row and those on the other rows differs between the conditions with and without sunglasses.

## The interaction contrast with SPSS

I’ve downloaded a dataset from the supplementary materials accompanying Haans (2018) from http://pareonline.net/sup/v23n9.zip (Between2by4data.sav) and I ran the following syntax in SPSS:

UNIANOVA retention BY sunglasses location
/LMATRIX = "Interaction contrast"
sunglasses*location 1 -1/3 -1/3 -1/3 -1 1/3 1/3 1/3 intercept 0
/DESIGN= sunglasses*location.

Table 1 is the relevant part of the output.

So, the estimate of the interaction contrasts equals 1.00, 95% CI [-0.332, 2.332]. (See this post for optimizing the sample size to get a more precise estimate than this).

## Contrast analysis with R for factorial designs

Let’s see how we can get the same results with R.

library(MASS)
library(foreign)

theData <- as.data.frame(theData)

attach(theData)

# setting contrasts
contrasts(sunglasses) <- ginv(rbind(c(1, -1)))
contrasts(location)  <- ginv(rbind(c(1, -1/3, -1/3, -1/3),
c(0, 1, -1/2, -1/2), c(0, 0, 1, -1)))

# fitting model

myMod <- lm(retention ~ sunglasses*location)

The code above achieves the following. First the relevant packages are loaded. The MASS package provides the function ginv, which we need to specify custom contrasts and the Foreign package contains the function read.spss, which enables R to read SPSS .sav datafiles.

Getting custom contrast estimates involves calculating the generalized inverse of the contrast matrices for the two factors. Each contrast is specified on the rows of these contrast matrices. For instance, the contrast matrix for the factor location, which has 4 levels, consists of 3 rows and 4 columns. In the above code, the matrix is specified with the function rbind, which basically says that the three contrast weight vectors c(1, -1/3, -1/3, -1/3), c(0, 1, -1/2, -1/2), c(0, 0, 1, -1) form the three rows of the contrast matrix that we use as an argument of the ginv function. (Note that the set of contrasts consists of orthogonal Helmert contrasts).

The last call is our call to the lm-function which estimates the contrasts. Let’s have a look at these estimates.

summary(myMod)
##
## Call:
## lm(formula = retention ~ sunglasses * location)
##
## Residuals:
##    Min     1Q Median     3Q    Max
##     -2     -1      0      1      2
##
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)             5.3750     0.1443  37.239  < 2e-16 ***
## sunglasses1             1.2500     0.2887   4.330 5.35e-05 ***
## location1               2.1667     0.3333   6.500 1.39e-08 ***
## location2               1.0000     0.3536   2.828  0.00624 **
## location3               2.0000     0.4082   4.899 6.88e-06 ***
## sunglasses1:location1   1.0000     0.6667   1.500  0.13853
## sunglasses1:location2   3.0000     0.7071   4.243 7.26e-05 ***
## sunglasses1:location3   2.0000     0.8165   2.449  0.01705 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.225 on 64 degrees of freedom
## Multiple R-squared:  0.6508, Adjusted R-squared:  0.6126
## F-statistic: 17.04 on 7 and 64 DF,  p-value: 1.7e-12

For the present purposes, we will consider the estimate of the first interaction contrast, which estimates the difference between the means of the first and  the other rows between the with and without sunglasses conditions. So, we will have to look at the sunglasses1:location1 row of the output.

Unsurprisingly, the estimate of the contrast and its standard error are the same as in the SPSS ouput in Table 1. The estimate equals 1.00 and the standard error equals 0.6667.

Note that the residual degrees of freedom equal 64. This is equal to the product of the number of levels of each factor, 2 and 4, and the number of participants (9) per combination of the levels minus 1: df =  2*4*(9 – 1) = 64. We will use these degrees of freedom to obtain a confidence interval of the estimate.

We will calculate the confidence interval by first extracting the contrast estimate and the standard error,  after which we multiply the standard error by the critical value of t with df = 64 and add the result to and substract it from the contrast estimate:

## [1] 246.4563
##
## $objective ## [1] 8.591375e-18  Thus, according to the optimize function we need 247 participants (per group; total N = 988), to get an expected MOE equal to our target MOE. The expected MOE equals 0.4553, which you can confirm by using the MOE function we made above. #### Planning with assurance Although expected MOE is close to our target MOE, there is a probability 50% that the obtained MOE will be larger than our target MOE. In other words, repeated sampling will lead to obtained MOEs larger than what we want. That is to say, we have 50% assurance that our obtained MOE will be at least as small as our target MOE. Planning with assurance means that we aim for a certain specified assurance that our obtained MOE will not exceed our target MOE. For instance, we may want to have 80% assurance that our obtained MOE will not exceed our target MOE. Basically, what we need to do is take the sampling distribution of the estimate of Mean Square Error into account. We use the following formula (see also my post introducing the Precision App for the general formulae: https://the-small-s-scientist.blogspot.nl/2017/04/planning-for-precision.html). where is the assurance expressed in a probability between 0 and 1. Let’s do it in R. Again, the function that calculates assurance MOE is tailored for the specific situation, but it is relatively easy to formulate these functions in a generally applicable way, MOE.gamma = function(n) { df = 4*(n-1) MOE = 2*qt(.975, df)*sqrt(3.324/n*qchisq(.80, df)/df) } loss <- function(n) { (MOE.gamma(n) - 0.4558)^2 } optimize(loss, c(100, 1000))  ##$minimum
## [1] 255.576
##
## \$objective
## [1] 2.900716e-18


Thus, according to the results, we need 256 persons per group (N = 1024 in total) to have a 80% probability of obtaining a MOE not larger than our target MOE. In that case, our expected MOE will be 0.4472.

## Planning for Precision: A confidence interval for the contrast estimate

In a previous post, which can be found here, I described how the relative error variance of a treatment mean can be obtained by combining variance components.  I concluded that post by mentioning how this relative error variance for the treatment mean can be used to obtain the variance of a contrast estimate. In this post, I will discuss a little more how this latter variance can be used to obtain a confidence interval for the contrast estimate, but we take a few steps back and consider a relatively simple study.

The plan of this post is as follows. We will have a look at the analysis of a factorial design and focus on estimating an interaction effect. We will consider both the NHST approach and an estimation approach. We will use both ‘hand calculations’ and SPSS.

An important didactic aspect of this post is to show the connection between the ANOVA source table and estimates of the standard error of a contrast estimate. Understanding that connections helps in understanding one of my planned posts on how obtaining these estimates work in the case of mixed model ANOVA.  See the final section of this post.

The data we will be analyzing are made up. They were specifically designed for an exam in one of the undergraduate courses I teach. The story behind the data is as follows.

### Description of the study

A researcher investigates the extent to which the presence of seductive details in a text influences text comprehension and motivation to read the text. Seductive details are pieces of information in a text that are included to make the text more interesting (for instance by supplying fun-facts about the topic of the text) in order to increase the motivation of the reader to read on in the text. These details are not part of the main points in the text. The motivation to read on may lead to increased understanding of the main points in the text. However, readers with much prior knowledge about the text topic may not profit as much as readers with little prior knowledge with respect to their understanding of the text, simply because their prior knowledge enables them to comprehend the text to an acceptable degree even without the presence of seductive details.

The experiment has two independent factors, the readers’ prior knowledge (1 = Little,  2 =  Much) and the presence of seductive details (1 = Absent, 2 = Present) and two dependent variables, Text comprehension and Motivation.  The experiment has a between participants design (i.e. participant nested within condition).

The research question is how much the effect of seductive details differs between readers with much and readers with little prior knowledge. This means that we are interested in estimating the interaction effect of presence of seductive details and prior knowledge on text comprehension.

### The NHST approach

In order to appreciate the different analytical focus between traditional NHST (as practiced) and an estimation approach, we will first take a look at the NHST approach to the analysis. It may be expected that researchers using that approach perform an ANOVA ritual as a means of answering the research question. Their focus will be on the statistical significance of the interaction effect, and if that interaction is significant the effect of seductive details will be investigated separately for participants with little and participants with much prior knowledge. The latter analysis focuses on whether these simple effects are significant or not. If the interaction effect is not significant, it will be concluded that there is no interaction effect. Of course, besides the interaction effect, the researcher performing the ANOVA ritual will also report the significance of the main effects and will conclude that main effects exist or not exist depending on whether they are significant or not. The more sophisticated version of NHST will also include an effect size estimate (if the corresponding significance test is significant) that is interpreted using rules of thumb.
The two way ANOVA output (including partial eta squared) is as follows.
 Table 1. Output of traditional two-way ANOVA

The results of the analysis will probably be reported as follows.

There was a significant main effect of prior knowledge (F(1, 393) = 39.26, p < .001, partial η2 = .09). Participants with much prior knowledge had a higher mean text comprehension score than the participants with little prior knowledge.  There was no effect of the presence of seductive details (F < 1).  The interaction effect was significant (F(1, 393) = 4.33, p < .05, partial η= .01).

Because of the significant interaction effect, simple effects analyses  were performed to further investigate the interaction. These results show a significant effect of the presence of seductive details for the participants with little knowledge (p < .05), with a higher mean score in the condition with seductive details, but for the participants with much prior knowledge no effect of seductive details was found (p = .38), which explains the interaction. (Note: with a Bonferroni correction for the two simple effects analyses the p-values are p = .08 and p = .74; this will be interpreted as that neither readers with little knowledge nor with much knowledge benefit from the presence of seductive details).

The conclusion from the traditional analysis is that the effect of seductive details differs between readers with little and readers with much prior knowledge. The presence of seductive details only has an effect on the comprehension scores of readers with little prior knowledge of the text topic, in the presence of seductive details text comprehension is higher than in the absence of seductive details. Readers with much prior knowledge do not benefit from the presence of seductive details in a text.

#### Comment on the NHST analysis

The first thing to note is that the NHST conclusion does not really answer the research question. Whereas the research question ask how much the effects differ, the answer to the research question is that a difference exists. This answer is further specified as that there exists an effect in the little knowledge group, but that there is no effect in the much knowledge group.
The second thing to note is that although there is a simple research questions, the report of the results includes five significance tests, while none of them actually address the research question. (Remember it is an how-much question and not a whether-question, the significance tests do not give useful information about the how-much question).
The third thing to notice is that although effect sizes estimates are included (for the significant effects only) they are not interpreted while drawing conclusions. Sometimes you will encounter such interpretations, but usually they have no impact on the answer to the research question. That is, the researcher may include in the report that there is a small interaction effect (using rules-of-thumb for the interpretation of partial eta-squared; .01 = small; .06 = medium, .14 = large), but the smallness of the interaction effect does not play a role in the conclusion (which simply reformulates the (non)significance of the results without mentioning numbers; i.e. that the effect exists (or was found) in one group but not in the other).

As an aside, the null-hypothesis test for the effect of prior knowledge i.e. that the mean comprehension score of readers with little knowledge are equal to the mean comprehension score of readers with much prior knowledge about the text topic seems to me an excellent example of a null-hypothesis that is so implausible that rejecting it is not really informative. Even if used as some sort of manipulation check the real question is the extent to which the groups differ and not whether the difference is exactly zero. That is to say, not every non-zero difference is as reassuring as every other non-zero difference: there should be an amount of difference between the groups below which the group performances can be considered to be practically the same. If a significance tests is used at all, the null-hypothesis should specify that minimum amount of difference.

### Estimating the interaction effect

We will now work towards estimating the interaction effect. We will do that in a number of steps. First, we will estimate the value of the contrasts on the basis of the estimated marginal means provided by the two-way ANOVA and show how the confidence interval of that estimate can be obtained. Second, we will use SPSS to obtain the contrast estimate.
Table 2 contains the descriptives and samples sizes for the groups and the estimated marginal means are presented in Table 3.
 Table 2. Descriptive Statistics
 Table 3. Estimated Marginal Means

Let’s spend a little time exploring the contents of Table 3. The estimated means speak for themselves, hopefully. These are simply estimates of the population means.

The standard errors following the means are used to calculate confidence intervals for the population means. The standard error is based on an estimate of the common population variance (the ANOVA model assumes homogeneity of variance and normally distributed residuals). That estimate of the common variance can be found in Table 1: it is the Mean Square Error. Its estimated value is 3.32, based on 389 degrees of freedom.

The standard errors of the means in Table 3 are simply the square root of the Mean Square Error dvided by the sample size. E.g. the standard error of the mean text comprehension in the group with little knowledge and seductive details absent equals √(3.32/94) = .1879.

The Margin of Error needed to obtain the confidence interval is the critical t-value with 389 degrees of freedom (the df of the estimate of Mean Square Error) multiplied by the standard error of the mean. E.g. the MOE of the first mean is t.975(389)*.1879 = 1.966*.1879 = 0.3694.

The 95%-confidence interval for the first mean is therefore 3.67 +/- 0.3694 = [3.30, 4.04].

#### Contrast estimate

We want to know the extent to which the effect of seductive details differs between readers with little and much prior knowledge. This means that we want to know the difference between the differences. Thus, the difference between the means of the Present (P) and Absent (A) of readers with Much (M) knowledge is subtracted from the difference between the means of the  readers with Little (L) knowledge: (ML+P – ML+A) – (MM+P – MM+A) = ML+P – ML+A – MM+P + MM+A = 4.210 – 3.670 – 4.980 + 5.206 =  0.766.

Our point estimate of the difference between the effect of seductive details for little knowledge readers and for much knowledge is  that the effect is 0.77 points larger in the group with little knowledge.

For the interval estimate we need the estimated standard error of the contrast estimate and a critical value for the central t-statistic. To begin with the latter: the degrees of freedom are the degrees of freedom used to estimate Mean Square Error (df = 389; see Table 1).

The standard error of the contrasts estimate can be obtained by using the variance sum law. That is,  the variance of the sum of two variables is the sum of their variances plus twice the covariance. And the variance of the difference between two variables is the sum of the variances minus twice the covariance. In the independent design, all the covariances between the means are zero, so the variance of the interaction contrast is simply the sum of the variances over the means. The standard error is the square root of this figure. Thus, var(interaction contrast) = 0.1882 + 0.1822 + 0.1852 + 0.1812 = 0.1354, and the standard error of the contrast is the square root of  0.1354 = .3680.

Note that the we have squared the standard errors of the mean. These squared standard error are the same as the relative error variances of the means. (Actually, in a participant nested under treatment condition (a between-subject design) the relative error variance of the mean equals the absolute error variance). More information about the error variance of the mean can be found here: https://the-small-s-scientist.blogspot.nl/2017/05/PFP-variance-components.html.

The Margin of Error of the contrast estimate is therefore t.975(389)*.3680 = 1.966*.3680 = 0.7235. The 95% confidence interval for the contrast estimate is [0.04, 1.49].

Thus, the answer to the research question is that the estimated difference in effect of seductive details between readers with little prior knowledge and readers with much prior knowledge about the text topic equals .77, 95% CI [.04, 1.49].  The 95% confidence interval shows that the estimate is very imprecise, since the limits of the interval are .04, which suggests that the effect of seductive details is essentially similar for the different groups of readers, and 1.49, which shows that the effect of seductive details may be much larger for little knowledge readers than for much knowledge readers.

#### Analysis with SPSS

I think it is easiest to obtain the contrast estimate by modeling the data with one-way ANOVA by including a factor I’ve called ‘independent’. (Note: In this simple case, the parameter estimates output of the independent factorial ANOVA also gives the interaction contrast (including the 95% confidence interval), so there is no actual need to specify contrasts, but I like to have the flexibility of being able to get an estimate that directly expresses what I want to know). This factor has 4 levels: one for each of the combinations of the factors prior knowledge and presence of seductive details: Little-Absent (LA), Little-Present (LP),  Much-Absent (MA), and Much-Present (MP).

The interaction we’re after is the difference between the mean difference between Present and Absent for participants with little knowledge (MLP – MLA) and the mean difference between Present and Absent in the much knowledge group (MMP – MMA).  Thus, the estimate of the interaction (difference between differences) is (MLP – MLA) – (MMP – MMA) = MLP – MLA – MMP + MMA. This can be rewritten as 1*MLP + -1*MLA + -1*MMP + 1*MMA).

The 1’s and -1’s are of course the contrast weights we have to specify in SPSS in order to get the estimate we want. We will have to make sure that the weights correspond to the way in which the order of the means is represented internally in SPSS. That order is LA, LP, MA, MP.  Thus, the contrast weights need to be specified in that order to get the estimate to express what we want in terms of the difference between differences. See the second line in the following SPSS-syntax.

UNIANOVA comprehension BY independent
/CONTRAST(independent)=SPECIAL ( -1 1 1 -1)
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(independent)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/DESIGN=independent.

The relevant output is presented in Table 4. Note that the results are the same as the ‘hand calculations’ described above (I find this very satisfying).

 Table 4. Interaction contrast estimate

#### Comment on the analysis

First note that the answer to the research question has been obtained with a single analysis. The analysis gives us a point estimate of the difference between the differences and a 95% confidence interval. The analysis is to the point to the extent that it gives the quantitative information we seek.
However, although the estimate of the difference between the differences is all the quantitative information we need to answer the how-much-research question, the estimate itself obscures the pattern in the results, in the sense that the estimate itself does not tell us what may be important for theoretical or practical reasons, namely the direction of the effect.  That is, a positive interaction contrast may indicate the difference between an estimated positive effect for one group and an estimated negative effect  in the other group (which is actually the situation in the present example: 0.54 – (-0.23) = 0.77) in the other group).
Of course, we could argue that if you want to know the extent to which the size and direction differ between the groups, then that should be reflected in your research question, for instance, by asking about and estimating the simple effects themselves in stead of focusing on the size of the difference alone, as we have done here.

On the other hand, we could argue that no result  of a statistical analysis should be interpreted in isolation. Thus, there is no problem with interpreting the estimate of 0.77 while referring to the simple effects: the estimated difference between the effects is .77,  95% CI [.04, 1.49], reflecting the difference between an estimated effect of 0.54 in the little knowledge group and an estimated negative effect of -0.23 for much knowledge readers.

But, if the research question is how large is the effect of seductive details for little knowledge readers and high knowledge readers and how much do the effect differ, than that would call for three point estimates and interval estimates. Like: the estimated effect for the little knowledge group equals 0.54. 95% CI [0.03, 1.06], whereas the estimated effect for the much knowledge groups is negative -0.23, 95% CI [-0.73, 0.28]. The difference in effect is therefore 0.77,  95% CI [.04, 1.49].

In all cases, of course, the intervals are so wide that no firm conclusions can be drawn. Consistent with the point estimates are negligibly small positive effects to large positive effects of seductive details for the little knowledge group,  small positive effects to negative effects of seductive details for the much knowledge group and an interaction effect that ranges from negligibly small to very large. In other words, the picture is not at all clear.  (Interpretations of the effect sizes are based on rules of thumb for Cohen’s d. A (biased) estimate of Cohen’s d can be obtained by dividing the point estimate by the square root of Mean Square Error. An approximate confidence interval can be obtained by dividing the end-points of the non-standardized confidence intervals by the square root of Mean Square Error). Of course, we have to keep in mind that 5% of the 95% confidence intervals do not contain the true value of the parameter or contrast we are estimating.

Compare this to the firm (but unwarranted) NHST conclusion that there is a positive effect of seductive details for little knowledge readers (we don’t know whether there is a positive effect, because we can make a type I error if we reject the null) and no effect for much knowledge readers. (Yes, I know that the NHST thought-police forbids interpreting non-significant results as “no effect”, but we are talking about NHST as practiced and empirical research shows that researchers interpret non-significance as no effect).

In any case, the wide confidence intervals show that we could do some work for a replication study in terms of optimizing the precision of our estimates. In a next post, I will show you how we can use our estimate of precision for planning that replication study.

### Summary of the procedure

In (one of the next) posts, I will show that in the case of mixed models ANOVA’s we frequently need to estimate the degrees of freedom in order to be able to obtain MOE for a contrast. But the basic logic remains the same as what we have done in estimating the confidence interval for the interaction contrast.  Please keep in mind the following.
Looking at the ANOVA source table and the traditional ANOVA approach we notice that the interaction effect is tested against Mean Square Error: the F-ratio we use to test the null-hypothesis that both Mean Squares (the interaction MS an Mean Square Error) estimate the common error variance. The F-ratio is formed by dividing the Mean Square associated with the interaction by Mean Square Error.  The probability distribution of that ratio is an F-distribution with 1 (numerator) and 389 (denominator) degrees of freedom.
Mean Square Error is also used to obtain the estimated standard error for the interaction contrast estimate. In the calculation of MOE, the critical value of t was determined on the basis of the degrees of freedom of Mean Square Error.
This is the case in general: the standard error of a contrast is based on the Mean Square Error that is also used to test the corresponding Effect (main or interaction) in an F-test. In a simple two-way ANOVA the same Mean Square Error is used to test all the effects (main an interaction), but that is not generally the case for more complex designs. Also, the degrees of freedom used to obtain a critical t-value for the calculation of MOE are the degrees of freedom of the Mean Square Error used to test an effect.
In the case of a mixed model ANOVA, it is often the case that there is no Mean Square Error available  to directly test an effect. The consequence of this is that we work with linear combinations of Mean Squares to obtain a suitable Mean Square Error for an effect and that we need to estimate the degrees of freedom. But the general logic is the same: the Mean Square Error that is obtained by a linear combination of Mean Squares is also used to obtain the standard error for the contrast estimate and the estimated degrees of freedom are the degrees of freedom used to obtain a critical value for t in the calculation of the Margin of Error.
I will try to write about all of that soon.

## Planning for precision with samples of participants and items

Many experiments involve the (quasi-)random selection of both participants and items. Westfall et al. (2014) provide a Shiny-app for power-calculations for five different experimental designs with selections of participants and items. Here I want to present my own Shiny-app for planning for precision of contrast estimates (for the comparison of up to four groups) in these experimental designs.  The app can be found here: https://gmulder.shinyapps.io/precision/

(Note: I have taken the code of Westfall’s app and added code or modified existing code to get precision estimates in stead of power; so, without Westfall’s app, my own modified version would never have existed).

The plan for this post is as follows. I will present the general theoretical background (mixed model ANOVA combined with ideas from Generalizability Theory) by considering comparing three groups in a counter balanced design.
Note 1: This post uses mathjax, so it’s probably unreadable on mobile devices. Note: a (tidied up) version (pdf) of this post can be downloaded here: download the pdf
Note 2: For simulation studies testing the procedure go here: https://the-small-s-scientist.blogspot.nl/2017/05/planning-for-precision-simulation.html
Note 3: I use the terms stimulus and item interchangeably; have to correct this to make things more readable and comparable to Westfall et al. (2014).
Note 4: If you do not like the technical details you can skip to an illustration of the app at the end of the post.

## The general idea

The focus of planning for precision is to try to minimize the half-width of a 95%-confidence interval for a comparison of means (in our case). Following Cumming’s (2012) terminology I will call this half-width the Margin of Error (MOE). The actual purpose of the app is to find required sample sizes for participants and items that have a high probability (‘assurance’) of obtaining a MOE of some pre-specified value.

## Expected MOE for a contrast

For a contrast estimate   we have the following expression for the expected MOE.

where is the standard error of the contrast estimate. Of course, both the standard error and the df are functions of the sample sizes.

For the standard error of a contrast with contrast weights through , where a is the number of treatment conditions,  we use the following general expression.

where n is the per treatment sample size (i.e. the number of participants per treatment condition times the number of items per treatment condition) and the within treatment variance (we assume homogeneity of variance).

For a simple example take an independent samples design with n = 20 participants responding to 1 item in one of two possible treatment conditions (this is basically the set up for the independent t-test). Suppose we have contrast weights and , and , the standard error for this contrast equals .  (Note that this is simply the standard error of the difference between two means as used in the independent samples t-test).

In this simple example, df is the total sample size (N = n*a) minus the number of treatment conditions (a), thus . The expected MOE for this design is therefore, . Note that using these figures entails that 95% of the contrast estimates will take values between the true contrast value plus and minus the expected MOE: .

For the three groups case, and contrast weights {}, the same sample sizes and within treatment variance gives .

(If you like, I’ve written a little document with derivation of the variance of selected contrast estimates in the fully crossed design for the comparison of two and three group means. That document can be found here: https://drive.google.com/open?id=0B4k88F8PMfAhaEw2blBveE96VlU)

The focus of planning for precision is to try to find sample sizes that minimize expected MOE to a certain target MOE.  The app uses an optimization function that minimizes the squared difference between expected MOE and target MOE to find the optimal (minimal) sample sizes required.

### Planning with assurance

If the expected MOE is equal to target MOE,  the sample estimate of MOE will be larger than your target MOE in 50% of replication experiments. This is why we plan with assurance (terminology from Cumming, 2012).  For instance, we may want to have a 95% probability (95% assurance) that the estimated MOE will not exceed our target MOE.

In order to plan with assurance, we need (an approximation of) the sampling distribution of MOE. In the ANOVA approach that underlies the app, this boils down to the distribution of estimates of

thus

In terms of the two-groups independent samples design above: the expected MOE equals 2.8629. But, with df = 38, there is an 80% probability (assurance) that the estimated MOE will be no larger than:

Note that the 45.07628 is the quantile in the chi-squared (df = 38) distribution. That is .

The app let’s  you specify a target MOE and a value for the desired assurance () and will find the combination of number of participants and items that will give an estimated MOE no larger than target MOE in % of the replication experiments.

## The mixed model ANOVA approach

Basically, what we need to plan for precision is to able to specify and the degrees of freedom. We will specify as a function of variance components and use the Satterthwaite procedure to approximate the degrees of freedom by means of a linear combination of expected mean squares. I will illustrate the approach with a three-treatment conditions counterbalanced design.

### A description of the design

Suppose we are interested in estimating the differences between three group means. We formulate two contrasts: one contrast estimates the mean difference between the first group and the average of the means of the second and third groups. The weights of the contrasts are respectively {1, -1/2, -1/2}, and {0, 1, -1}.

We are planning to use a counterbalanced design with a number of participants equal to p and a sample of items of size q. In the design we randomly assign participants to a groups, where a is the number of conditions, and randomly assign items to a lists (see Westfall et al., 2014 for more details about this design). All the groups are exposed to all lists of stimuli, but the groups are exposed to different lists in each condition. The number of group by list combinations equals , and the number of observations in each group by list combination equals . The condition means are estimated by combining a group by list combinations each of which composed of different participants and stimuli. The total number of observations per condition is therefore, .

### The ANOVA model

The ANOVA model for this design is

where the effect is a constant treatment effect (it’s a fixed effect), and the other effect are random effects with zero mean and variances (participants), (items), (person by treatment interaction), (item by treatment interaction) and (error variance confounded with the person by item interaction). Note: in Table 1 below, is (for technical reasons not important for this blogpost) presented as this confounding .

We make use of the following restrictions (Sahai & Ageel, 2000): , and . The latter two restrictions make the interaction-effects correlated across conditions (i,e. the effects of person and treatment are correlated across condition for the same person, likewise the interaction effects of item and treatment are correlated across conditons for the same item. Interaction effects of different participants and items are uncorrelated). The covariances between the random effects are assumed to be zero.

Under this model (and restrictions) , and . Furthermore, the covariance of the interactions between treatment and participant or between treatment and item for the same participant or item are for participants and for items.

### Within treatment variance

In order to obtain an expectation for MOE, we take the expected mean squares to get an expression or the expected within treatment variance . These expected means squares are presented in Table 1.

The expected within treatment variance can be found in the Treatment row in Table 1. It is comprised of all the components to the right of the component associated with the treatment effect (). Thus, . Note that the latter equals the sum of the expected mean squares of the Treatment by Participant () and the Treatment by Item () interactions, minus the expected mean square associated with Error ().

### Degrees of freedom

The second ingredient we need in order to obtain expected MOE are the degrees of freedom that are used to estimate the within treatment variance. In the ANOVA approach the within treatment variance is estimated by a linear combination of mean squares (as described in the last sentence of the previous section. This linear combination is also used to obtain approximate degrees of freedom using the Satterthwaite procedure:

1.

### Expected MOE

(Note: I can’t seem to get mathjax to generate align environments or equation arrays, so the following is ugly; Note to self: next time use R-studio or Lyx to generate R-html or an equivalent format).

The expected value of MOE for the contrasts in the counter balanced design is:

### Finally an example

Suppose we the scores in three conditions are normally distributed with (total) variances . Suppose furthermore, that 10% of the variance can be attributed to treatment by participant interaction, 10% of the variance to the treatment by item interaction and 40% of the variance to the error confounded with the participant by item interaction. (which leaves 40% of the total variance attributable to participant and item variance.

Thus, we have , , and . Our target MOE is .25, and we plan to use the counterbalanced design with p = 30 participants, and q = 15 items (stimuli).

Due to the model restrictions presented above we have , , and .

The value of is therefore, , and the approximate df equal .

For the first contrast, with weights {1, -1/2. -1/2}, then, the Expected value for the Margin of Error is .

For the second contrast, with weights {0, 1, -1}, the Expected value of the Margin of Error is

Thus, using p = 30 participants, and q = 15 items (stimuli) will not lead to an expected MOE larger than the target MOE of .25.

We can use the app to find the required number of participants and items for a given target MOE. If the number of groups is larger than two, the app uses the contrast estimate with the largest expected MOE to calculate the sample sizes (in the default setting the one comparing only two group means). The reasoning is that if the least precise estimate (in terms of MOE) meets our target precision, the other ones meet our target precision as well.

## Using the app

I’ve included lot’ of comments in the app itself, but please ignore references to a manual (does not exist, yet, except in Dutch) or an article (no idea whether or not I’ll be able to finish the write-up anytime soon). I hope the app is pretty straightforward. Just take a look at  https://gmulder.shinyapps.io/precision/, but the basic idea is:
– Choose one of five designs
– Supply the number of treatment conditions
– Specify contrast(weights) (or use the default ones)
– Supply target MOE and assurance
– Supply values of variance components (read (e,g,) Westfall, et al, 2014, for more details).
– Supply a number of participants and items
– Choose run precision analysis with current values or
– Choose get sample sizes. (The app gives two solutions: one minimizes the number of participants and the other minimizes the number of stimuli/items). NOTE: the number of stimuli is always greater than or equal to 10 and the number of participants is always greater than or equal to 20.

### An illustration

Take the example above. Out target MOE equals .25, and we want insurance of .80 to get an estimated MOE of no larger than .25. We use a counter-balanced design with three conditions, and want to estimate two contrasts: one comparing the first mean with the average of means two and three, and the other contrast compares the second mean with the third mean. We can use the default contrasts.
For the variance components, we use the default values provided by Westfall et al. (2014) for the variance components. These are also the default values in the app (so we don’t need to change anything now).
Let’s see what happens when we propose to use p = 30 participants and q = 15 items/stimuli.
Here is part of a screenshot from the app:
These results show that the expected MOE for the first contrast (comparing the first mean with the average of the other means) equals 0.3290, and assurance MOE for the same contrasts equals 0.3576. Remember that we specified the assurance as .80. So, this means that 80% of the replication experiments give estimated MOE as large as or smaller than 0.3576. But we want that to be at most 0.2500.  Thus, 30 participants and 15 items do suffice for our purposes.
Let’s use to app to get sample sizes. The results are as follows.

The app promises that using 25 stimuli combined with 290 participants or 25 participants and 290 items will do the trick (the symmetry of these results are due to the fact that the interaction components are equal; both the treatment by participant and the treatment by stimulus interaction component equal .10).  Since we have 3 treatment conditions using 290 participants or stimuli is a little awkward, so I suggest to use 291 (equals 97 participants per group or 97 items per list). (300 is a much nicer figure of course). Likewise, as it is hard to equally divide 25 stimuli or participants over three lists or groups, use a multiple of three (say: 27).

If we input the suggest sample sizes in the app, we see the following results if we choose the run precision analysis  with current values.

As you can see: Assurance MOE is close to 0.25 (.24) for the second contrast (the least precise one), so 80% of replication experiments will get estimated MOE of 0.25 (.24) or smaller. The expected precision is 0.22. The first contrast (which can be estimated with more precision) has assurance MOE of 0.21 and expected MOE of approximately 0.19.  Thus, the sample sizes lead to the results we want.

### References

Cumming, G. (2012). Understanding the New Statistics. New York/London: Routledge.

Sahai, H., & Ageel, M. I. (2000). The analysis of variance. Fixed, Random, and Mixed Models. Boston/Basel/Berlin: Birkhäuser.

Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143(5), 2020-2045.