Planning for a precise slope estimate in simple regression

In this post, I will show you a way of determining a sample size for obtaining a precise estimate of the slope $\beta_1$ of the simple linear regression equation $\hat{Y_i} = \beta_0 + \beta_1X_i$ . The basic ingredients we need for sample size planning are a measure of the precision, a way to determine the quantiles of the sampling distribution of our measure of precision, and a way to calculate sample sizes.

As our measure of precision we choose the Margin of Error (MOE), which is the half-width of the 95% confidence interval of our estimate (see: Cumming, 2012; Cumming & Calin-Jageman, 2017; see also www.thenewstatistics.com).

The distribution of the margin of error of the regression slope

In the case of simple linear regression, assuming normality and homogeneity of variance, MOE is $t_{.975}\sigma_{\hat{\beta_1}}$ , where $t_{.975}$ , is the .975 quantile of the central t-distribution with $N - 2$ degrees of freedom, and $\sigma_{\hat{\beta_1}}$ is the standard error of the estimate of $\beta_1$ .

An expression of the squared standard error of the estimate of $\beta_1$ is $\frac{\sigma^2_{Y|X}}{\sum{(X_i - \bar{X})}^2}$ (Wilcox, 2017): the variance of Y given X divided by the sum of squared errors of X. The variance $\sigma^2_{Y|X}$ equals $\sigma^2_y(1 - \rho^2_{YX})$ , the variance of Y multiplied by 1 minus the squared population correlation between Y and X, and it is estimated with the residual variance $\frac{\sum{(Y -\hat{Y})^2}}{df_e}$ , where $df_e = N - 2$ .

The estimated squared standard error is given in (1)

(1) $\hat{\sigma}_{\hat{\beta_{1}}}^{2}=\frac{\sum(Y-\hat{Y})^{2}/df_{e}}{\sum(X-\bar{X})^{2}}.$

With respect to the sampling distribution of MOE, we first note the following. The distribution of estimates of the residual variance in the numerator of (1) is a scaled $\chi^2$ -distribution:

$\frac{\sum(Y-\hat{Y})^{2}}{\sigma_{y}^{2}(1-\rho^{2})}\sim\chi^{2}(df_{e}),$

thus

$\frac{\sum(Y-\hat{Y})^{2}}{df_{e}}\sim\frac{\sigma_{y}^{2}(1-\rho^{2})\chi^{2}(df_{e})}{df_{e}}.$

Second, we note that

$\frac{\sum(X-\bar{X})^{2}}{\sigma_{X}^{2}}\sim\chi^{2}(df),$

where $df = N - 1$ , therefore

$\sum(X-\bar{X})^{2}\sim\sigma_{X}^{2}\chi^{2}(df).$

Alternatively, since $\sum{(X - \bar{X})^2} = df\sigma^2_X$ , and multiplying by 1 ( $\frac{df}{df}$ ).

$df\sigma_{X}^{2}\sim df\sigma_{X}^{2}\chi^{2}(df)/df.$

In terms of the sampling distribution of (1), then, we have the ratio of two (scaled) $\chi^2$ distributions, one with $df_e = N - 2$ degrees of freedom, and one with $df = N - 1$ degrees of freedom. Or something like:

$\hat{\sigma}_{\hat{\beta_{1}}}^{2}\sim\frac{\sigma_{y}^{2}(1-\rho^{2})\chi^{2}(df_{e})/df_{e}}{df\sigma_{X}^{2}\chi^{2}(df)/df}=\frac{\sigma_{y}^{2}(1-\rho^{2})}{df\sigma_{X}^{2}}\frac{\chi^{2}(df_{e})/df_{e}}{\chi^{2}(df)/df}=\frac{\sigma_{y}^{2}(1-\rho^{2})F(df_{e,}df)}{df\sigma_{X}^{2}},$

which means that the sampling distribution of MOE is:

(2) $\hat{MOE}\sim t_{.975}(N-2)\sqrt{\frac{\sigma_{y}^{2}(1-\rho^{2})F(N-2,N-1)}{(N-1)\sigma_{X}^{2}}}.$

This last equation, that is (2), can be used to obtain quantiles of the sampling distribution of MOE, which enables us to determine assurance MOE, that is the value of MOE that under repeated sampling will not exceed a target value with a given probability. For instance, if we want to know the .80 quantile of estimates of MOE, that is, assurance is .80, we determine the .80 quantile of the (central) F-distribution with N – 2 and N – 1 degrees of freedom and fill in (2) to obtain a value of MOE that will not be exceeded in 80% of replication experiments.

For instance, suppose $\sigma^2_Y = 1$ , $\sigma^2_X = 1$ , $\rho = .50$ , $N = 100$ , and assurance is .80, then according to (2), 80% of estimated MOEs will not exceed the value given by:

vary = 1
varx = 1
rho = .5
N = 100 
dfe = N - 2
dfx - N - 1
assu = .80
t = qt(.975, dfe)
MOE.80 = t*sqrt(vary*(1 - rho^2)*qf(.80, dfe, dfx)/(dfx*varx))
MOE.80

## [1] 0.1880535

What does a quick simulation study tell us?

A quick simulation study may be used to check whether this is at all accurate. And, yes, the estimated quantile from the simulation study is pretty close to what we would expect based on (2). If you run the code below, the estimate equals 0.1878628.

library(MASS)
set.seed(355)
m = c(0, 0)

# note: s below is the variance-covariance matrix. In this case,
# rho and the cov(y, x) have the same values
# otherwise: rho = cov(x, y)/sqrt(varY*VarX) (to be used in the 
# functions that calculate MOE)
# equivalently, cov(x, y) = rho*sqrt(varY*varX) (to be used
# in the specification of the variance-covariance matrix for 
#generating bivariate normal variates)

s = matrix(c(1, .5, .5, 1), 2, 2)
se <- rep(10000, 0)
for (i in 1:10000) {
theData <- mvrnorm(100, m, s)
mod <- lm(theData[,1] ~ theData[,2])
se[i] <- summary(mod)$coefficients[4]
}
MOE = qt(.975, 98)*se
quantile(MOE, .80)

##       80% 
## 0.1878628

Planning for precision

If we want to plan for precision we can do the following. We start by making a function that calculates the assurance quantile of the sampling distribution of MOE described in (2). Then we formulate a squared cost function, which we will optimize for the sample sizeusing the optimize function in R.

Suppose we want to plan for a target MOE of .10 with 80% assurance.We may do the following.

vary = 1
varx = 1
rho = .5
assu = .80
tMOE = .10

MOE.assu = function(n, vary, varx, rho, assu) {
        varY.X = vary*(1 - rho^2)
        dfe = n - 2
        dfx = n - 1
        t = qt(.975, dfe)
        q.assu = qf(assu, dfe, dfx)
        MOE = t*sqrt(varY.X*q.assu/(dfx * varx))
        return(MOE)
}

cost = function(x, tMOE) { 
cost = (MOE.assu(x, vary=vary, varx=varx, rho=rho, assu=assu) 
- tMOE)^2
}

#note samplesize is at least 40, at most 5000. 
#note that since we already know that N = 100 is not enough
#in stead of 40 we might just as well set N = 100 at the lower
#limit of the interval
(samplesize = ceiling(optimize(cost, interval=c(40, 5000), 
tMOE = tMOE)$minimum))

## [1] 321

#check the result: 
MOE.assu(samplesize, vary, varx, rho, assu)

## [1] 0.09984381

Let’s simulate with the proposed sample size

Let’s check it with a simulation study. The value of estimated .80 of estimates of MOE is 0.1007269 (if you run the below code with random seed 335), which is pretty close to what we would expect based on (2).

set.seed(355)
m = c(0, 0)

# note: s below is the variance-covariance matrix. In this case,
# rho and the cov(y, x) have the same values
# otherwise: rho = cov(x, y)/sqrt(varY*VarX) (to be used in the 
# functions that calculate MOE)
# equivalently, cov(x, y) = rho*sqrt(varY*varX) (to be used
# in the specification of the variance-covariance matrix for 
# generating bivariate normal variates)

s = matrix(c(1, .5, .5, 1), 2, 2)
se <- rep(10000, 0)
samplesize = 321
for (i in 1:10000) {
theData <- mvrnorm(samplesize, m, s)
mod <- lm(theData[,1] ~ theData[,2])
se[i] <- summary(mod)$coefficients[4]
}
MOE = qt(.975, 98)*se
quantile(MOE, .80)

##       80% 
## 0.1007269

References

Cumming, G. (2012). Understanding the New Statistics. Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge
Cumming, G., & Calin-Jageman, R. (2017). Introduction to the New Statistics: Estimation, Open Science, and Beyond. New York: Routledge.
Wilcox, R. (2017). Understanding and Applying Basic Statistical Methods using R. Hoboken, New Jersey: John Wiley and Sons.