Planning for precision with samples of participants and items

Many experiments involve the (quasi-)random selection of both participants and items. Westfall et al. (2014) provide a Shiny-app for power-calculations for five different experimental designs with selections of participants and items. Here I want to present my own Shiny-app for planning for precision of contrast estimates (for the comparison of up to four groups) in these experimental designs.  The app can be found here: https://gmulder.shinyapps.io/precision/

(Note: I have taken the code of Westfall’s app and added code or modified existing code to get precision estimates in stead of power; so, without Westfall’s app, my own modified version would never have existed).

The plan for this post is as follows. I will present the general theoretical background (mixed model ANOVA combined with ideas from Generalizability Theory) by considering comparing three groups in a counter balanced design.
Note 1: This post uses mathjax, so it’s probably unreadable on mobile devices. Note: a (tidied up) version (pdf) of this post can be downloaded here: download the pdf
Note 2: For simulation studies testing the procedure go here: https://the-small-s-scientist.blogspot.nl/2017/05/planning-for-precision-simulation.html
Note 3: I use the terms stimulus and item interchangeably; have to correct this to make things more readable and comparable to Westfall et al. (2014).
Note 4: If you do not like the technical details you can skip to an illustration of the app at the end of the post.

The general idea

The focus of planning for precision is to try to minimize the half-width of a 95%-confidence interval for a comparison of means (in our case). Following Cumming’s (2012) terminology I will call this half-width the Margin of Error (MOE). The actual purpose of the app is to find required sample sizes for participants and items that have a high probability (‘assurance’) of obtaining a MOE of some pre-specified value.

Expected MOE for a contrast

For a contrast estimate   we have the following expression for the expected MOE.

where is the standard error of the contrast estimate. Of course, both the standard error and the df are functions of the sample sizes.

For the standard error of a contrast with contrast weights through , where a is the number of treatment conditions,  we use the following general expression.

where n is the per treatment sample size (i.e. the number of participants per treatment condition times the number of items per treatment condition) and the within treatment variance (we assume homogeneity of variance).

For a simple example take an independent samples design with n = 20 participants responding to 1 item in one of two possible treatment conditions (this is basically the set up for the independent t-test). Suppose we have contrast weights and , and , the standard error for this contrast equals .  (Note that this is simply the standard error of the difference between two means as used in the independent samples t-test).

In this simple example, df is the total sample size (N = n*a) minus the number of treatment conditions (a), thus . The expected MOE for this design is therefore, . Note that using these figures entails that 95% of the contrast estimates will take values between the true contrast value plus and minus the expected MOE: .

For the three groups case, and contrast weights {}, the same sample sizes and within treatment variance gives .

(If you like, I’ve written a little document with derivation of the variance of selected contrast estimates in the fully crossed design for the comparison of two and three group means. That document can be found here: https://drive.google.com/open?id=0B4k88F8PMfAhaEw2blBveE96VlU)

The focus of planning for precision is to try to find sample sizes that minimize expected MOE to a certain target MOE.  The app uses an optimization function that minimizes the squared difference between expected MOE and target MOE to find the optimal (minimal) sample sizes required.

Planning with assurance

If the expected MOE is equal to target MOE,  the sample estimate of MOE will be larger than your target MOE in 50% of replication experiments. This is why we plan with assurance (terminology from Cumming, 2012).  For instance, we may want to have a 95% probability (95% assurance) that the estimated MOE will not exceed our target MOE.

In order to plan with assurance, we need (an approximation of) the sampling distribution of MOE. In the ANOVA approach that underlies the app, this boils down to the distribution of estimates of

thus

In terms of the two-groups independent samples design above: the expected MOE equals 2.8629. But, with df = 38, there is an 80% probability (assurance) that the estimated MOE will be no larger than:

Note that the 45.07628 is the quantile in the chi-squared (df = 38) distribution. That is .

The app let’s  you specify a target MOE and a value for the desired assurance () and will find the combination of number of participants and items that will give an estimated MOE no larger than target MOE in % of the replication experiments.

The mixed model ANOVA approach

Basically, what we need to plan for precision is to able to specify and the degrees of freedom. We will specify as a function of variance components and use the Satterthwaite procedure to approximate the degrees of freedom by means of a linear combination of expected mean squares. I will illustrate the approach with a three-treatment conditions counterbalanced design.

A description of the design

Suppose we are interested in estimating the differences between three group means. We formulate two contrasts: one contrast estimates the mean difference between the first group and the average of the means of the second and third groups. The weights of the contrasts are respectively {1, -1/2, -1/2}, and {0, 1, -1}.

We are planning to use a counterbalanced design with a number of participants equal to p and a sample of items of size q. In the design we randomly assign participants to a groups, where a is the number of conditions, and randomly assign items to a lists (see Westfall et al., 2014 for more details about this design). All the groups are exposed to all lists of stimuli, but the groups are exposed to different lists in each condition. The number of group by list combinations equals , and the number of observations in each group by list combination equals . The condition means are estimated by combining a group by list combinations each of which composed of different participants and stimuli. The total number of observations per condition is therefore, .

The ANOVA model

The ANOVA model for this design is

where the effect is a constant treatment effect (it’s a fixed effect), and the other effect are random effects with zero mean and variances (participants), (items), (person by treatment interaction), (item by treatment interaction) and (error variance confounded with the person by item interaction). Note: in Table 1 below, is (for technical reasons not important for this blogpost) presented as this confounding .

We make use of the following restrictions (Sahai & Ageel, 2000): , and . The latter two restrictions make the interaction-effects correlated across conditions (i,e. the effects of person and treatment are correlated across condition for the same person, likewise the interaction effects of item and treatment are correlated across conditons for the same item. Interaction effects of different participants and items are uncorrelated). The covariances between the random effects are assumed to be zero.

Under this model (and restrictions) , and . Furthermore, the covariance of the interactions between treatment and participant or between treatment and item for the same participant or item are for participants and for items.

Within treatment variance

In order to obtain an expectation for MOE, we take the expected mean squares to get an expression or the expected within treatment variance . These expected means squares are presented in Table 1.

The expected within treatment variance can be found in the Treatment row in Table 1. It is comprised of all the components to the right of the component associated with the treatment effect (). Thus, . Note that the latter equals the sum of the expected mean squares of the Treatment by Participant () and the Treatment by Item () interactions, minus the expected mean square associated with Error ().

Degrees of freedom

The second ingredient we need in order to obtain expected MOE are the degrees of freedom that are used to estimate the within treatment variance. In the ANOVA approach the within treatment variance is estimated by a linear combination of mean squares (as described in the last sentence of the previous section. This linear combination is also used to obtain approximate degrees of freedom using the Satterthwaite procedure:

1.

Expected MOE

(Note: I can’t seem to get mathjax to generate align environments or equation arrays, so the following is ugly; Note to self: next time use R-studio or Lyx to generate R-html or an equivalent format).

The expected value of MOE for the contrasts in the counter balanced design is:

Finally an example

Suppose we the scores in three conditions are normally distributed with (total) variances . Suppose furthermore, that 10% of the variance can be attributed to treatment by participant interaction, 10% of the variance to the treatment by item interaction and 40% of the variance to the error confounded with the participant by item interaction. (which leaves 40% of the total variance attributable to participant and item variance.

Thus, we have , , and . Our target MOE is .25, and we plan to use the counterbalanced design with p = 30 participants, and q = 15 items (stimuli).

Due to the model restrictions presented above we have , , and .

The value of is therefore, , and the approximate df equal .

For the first contrast, with weights {1, -1/2. -1/2}, then, the Expected value for the Margin of Error is .

For the second contrast, with weights {0, 1, -1}, the Expected value of the Margin of Error is

Thus, using p = 30 participants, and q = 15 items (stimuli) will not lead to an expected MOE larger than the target MOE of .25.

We can use the app to find the required number of participants and items for a given target MOE. If the number of groups is larger than two, the app uses the contrast estimate with the largest expected MOE to calculate the sample sizes (in the default setting the one comparing only two group means). The reasoning is that if the least precise estimate (in terms of MOE) meets our target precision, the other ones meet our target precision as well.

Using the app

I’ve included lot’ of comments in the app itself, but please ignore references to a manual (does not exist, yet, except in Dutch) or an article (no idea whether or not I’ll be able to finish the write-up anytime soon). I hope the app is pretty straightforward. Just take a look at  https://gmulder.shinyapps.io/precision/, but the basic idea is:
– Choose one of five designs
– Supply the number of treatment conditions
– Specify contrast(weights) (or use the default ones)
– Supply target MOE and assurance
– Supply values of variance components (read (e,g,) Westfall, et al, 2014, for more details).
– Supply a number of participants and items
– Choose run precision analysis with current values or
– Choose get sample sizes. (The app gives two solutions: one minimizes the number of participants and the other minimizes the number of stimuli/items). NOTE: the number of stimuli is always greater than or equal to 10 and the number of participants is always greater than or equal to 20.

An illustration

Take the example above. Out target MOE equals .25, and we want insurance of .80 to get an estimated MOE of no larger than .25. We use a counter-balanced design with three conditions, and want to estimate two contrasts: one comparing the first mean with the average of means two and three, and the other contrast compares the second mean with the third mean. We can use the default contrasts.
For the variance components, we use the default values provided by Westfall et al. (2014) for the variance components. These are also the default values in the app (so we don’t need to change anything now).
Let’s see what happens when we propose to use p = 30 participants and q = 15 items/stimuli.
Here is part of a screenshot from the app:
These results show that the expected MOE for the first contrast (comparing the first mean with the average of the other means) equals 0.3290, and assurance MOE for the same contrasts equals 0.3576. Remember that we specified the assurance as .80. So, this means that 80% of the replication experiments give estimated MOE as large as or smaller than 0.3576. But we want that to be at most 0.2500.  Thus, 30 participants and 15 items do suffice for our purposes.
Let’s use to app to get sample sizes. The results are as follows.

The app promises that using 25 stimuli combined with 290 participants or 25 participants and 290 items will do the trick (the symmetry of these results are due to the fact that the interaction components are equal; both the treatment by participant and the treatment by stimulus interaction component equal .10).  Since we have 3 treatment conditions using 290 participants or stimuli is a little awkward, so I suggest to use 291 (equals 97 participants per group or 97 items per list). (300 is a much nicer figure of course). Likewise, as it is hard to equally divide 25 stimuli or participants over three lists or groups, use a multiple of three (say: 27).

If we input the suggest sample sizes in the app, we see the following results if we choose the run precision analysis  with current values.

As you can see: Assurance MOE is close to 0.25 (.24) for the second contrast (the least precise one), so 80% of replication experiments will get estimated MOE of 0.25 (.24) or smaller. The expected precision is 0.22. The first contrast (which can be estimated with more precision) has assurance MOE of 0.21 and expected MOE of approximately 0.19.  Thus, the sample sizes lead to the results we want.

References

Cumming, G. (2012). Understanding the New Statistics. New York/London: Routledge.

Sahai, H., & Ageel, M. I. (2000). The analysis of variance. Fixed, Random, and Mixed Models. Boston/Basel/Berlin: Birkhäuser.

Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143(5), 2020-2045.

What is NHST, anyway?

I am not a fan of NHST (Null Hypothesis Significance Testing). Or maybe I should say, I am no longer a fan. I used to believe that rejecting null-hypotheses of zero differences based on the  p-value was the proper way of gathering evidence for my substantive hypotheses. And the evidential nature of the p-value seemed so obvious to me, that I frequently got angry when encountering what I believed were incorrect p-values, reasoning that if the p-value is incorrect, so must be the evidence in support of the substantive hypothesis.
For this reason, I refused to use the significance tests that were most frequently used in my field, i.e. performing a by-subjects analysis and a by-item analysis and concluding the existence of an effect if both are significant,  because the by-subjects analyses in particular regularly leads to p-values that are too low, which leads to believing you have evidence while you really don’t.  And so I spent a huge amount of time, coming from almost no statistical background – I followed no more than a few introductory statistics courses – , mastering mixed model ANOVA and hierarchical linear modelling (up to a reasonable degree; i.e. being able to get p-values for several experimental designs).  Because these techniques, so I believed, gave me correct p-values. At the moment, this all seems rather silly to me.
I still have some NHST unlearning to do. For example, I frequently catch myself looking at a 95% confidence interval to see whether zero is inside or outside the interval, and actually feeling happy when zero lies outside it (this happens when the result is statistically significant). Apparently, traces of NHST are strongly embedded in my thinking. I still have to tell myself not to be silly, so to say.
One reason for writing this blog is to sharpen my thinking about NHST and trying to figure out new and comprehensible ways of explaining to students and researchers why they should be vary careful in considering NHST as the sine qua non of research. Of course,  if you really want to make your reasoning clear, one of the first things you should do is define the concepts you’re reasoning about. The purpose of this post is therefore to make clear what my “definition” of NHST is.
My view of NHST  is very much based on how Gigerenzer et al. (1989) describe it:
“Fisher’s theory of significance testing, which was historically first, was merged with concepts from the Neyman-Pearson theory and taught as “statistics” per se. We call this compromise the “hybrid theory” of statistical inference, and it goes without saying the neither Fisher nor Neyman and Pearson would have looked with favor on this offspring of their forced marriage.” (p. 123, italics in original).
Actually, Fisher’s significance testing and Neyman-Pearson’s hypothesis testing are fundamentally incompatible (I will come back to this later), but almost no texts explaining statistics to psychologists “presented Neyman and Pearson’s theory as an alternative to Fisher’s, still less as a competing theory. The great mass of texts tried to fuse the controversial ideas into some hybrid statistical theory, as described in section 3.4. Of course, this meant doing the impossible.” (p. 219, italics in original).
So, NHST is an impossible, as in logically incoherent, “statistical theory”, because it (con)fuses concepts from incompatible statistical theories. If this is true, which I think it is, doing science with a small s, which involves logical thinking, disqualifies NHST as a main means of statistical inference. But let me write a little bit more about Fisher’s ideas and those of Neyman and Pearson, to explain the illogic of NHST.

I will try to describe the main characteristics of  the two approaches that got hybridized in NHST at a conceptual level. I will have to simplify a lot and I hope these simplifications do little harm. Let’s start with Fisher’s significance testing.

Fisher’s significance testing

The main purpose of Fisher’s significance testing is gathering evidence about parameters in a statistical model on the basis of a sample of data. So, the nature of the approach is evidential. Crucially, the evidence the data provides can only be evidence against a statistical model, but it can not be evidence in favour of the model, much in line with Popper’s idea  of progress in science by means of falsification. The statistical model to be nullified, i.e. the model one tries to obtain evidence against, is called the null-hypothesis.

Conceptually, the statistical model is a descriptive model of a population of possible values. An important part of Fisher’s approach is therefore to judge what kind of model provides an appropriate model of the population. For instance, this process of formulating the model (which, of course, involves a lot of thought and judgement) may lead one to assume that the random variable has a normal distribution, which is characterized by only two parameters, μ the expected value or mean of the distribution and σ, the square root of the variance of the distribution, which in the case of the normal distribution is it’s standard deviation (the standard deviation is the square root of the variance).

The values of μ and σ (or σ2) are generally unknown, but we may assume (again as a result of thinking and judging) that they have particular values. For reasons of exposition, I will now assume that the value of σ is known, say σ = 15, so that we only have to take the unknown value of μ into account. Let’s suppose that our thinking and judging has led us to assume that the unknown value of μ = 100.  The null-hypothesis is therefore that the variable has a normal distribution with μ = 100, and σ = 15.

We can obtain evidence against this null-hypothesis, by determining a p-value. We first gather data, say we take a random sample of N = 225 participants, which enables us to obtain observed values of the variable. Next, we calculate a test statistic, for example by estimating the value of  μ (on the basis of our data) subtracting the hypothesized value and dividing the estimate by it’s standard error. Our estimated value may for example be 103, and the standard error equals 15 / √225 = 1.0, so the value of the test statistic equals (103 – 100) / 1 = 3. And now we are ready to calculate the p-value.

The p-value is the probability of obtaining (when sampling repeatedly) a value of the test statistic as large as or larger than the one obtained in the study, provided that the null-hypothesis is true. This probability can be calculated because the exact distribution of the test statistic can be deduced from the specification of the null-hypothesis. In our example, the test statistic is approximately normally distributed with μ = 0, and σ = 1.0. (The distribution is approximately normal, assuming the null-hypothesis is true, so the p-value in our example not exact). The p-value equals 0.003. (This is the so-called two-sided p-value, it is the probability of obtaining a value equal or larger than 3 or equal of smaller than -3, but we will ignore the technicalities of two-sided tests).

The p-value tells us that if the null-hypothesis is true, and we repeatedly take random samples from the population (as described by the null-hypothesis) we will find a value of our test statistic or a larger value in 0.3% of these samples. Thus, the probability of obtaining a value equal to or larger than 3.0 is very small.

Following Fisher, this low p-value can be interpreted as that something “improbable” occurred (assuming the null-hypothesis is true) or as inductive evidence against the null-hypothesis, i.e. the null-hypothesis is not true.

In his early writings Fisher proposed a p-value smaller than .05 as inductive evidence against the null-hypothesis (keeping in mind the possibility that the null is true, but that something improbable happened), but later he thought using the fixed criterion of .05 to be non-scientific.  If the p-value is smaller than the criterion (say .05), the result is statistically significant.
In sum, the approach by Fisher, significance testing, involves specifying a statistical model, and using the p-value to test the assumptions of the model, such as specific values for μ or σ. If the p-value is smaller than the criterion value, either something improbable occurred or the null-hypothesis is not true. Crucially, the p-value may provide inductive evidence against the assumptions of the null-hypothesis, but a large p-value (larger than the criterion value) is not inductive support for the null-hypothesis.

Neyman-Pearson hypothesis testing

In contrast to Fisher’s evidential approach, Neyman and Pearson’s hypothesis testing is non-evidential.  Its primary goal is to choose on the basis of repeated random sampling between two hypotheses (or more; but I will only consider two)  in order to make behavioral decisions (so to speak) that will minimize decision errors and their associated costs (loss) in the long run. In stead of trying to figure out which of the two hypotheses is true, one decides to accept  one (and reject the other) of the two hypothesis as if it were true, without actually having to believe it, and act accordingly.
As with Fisher, Neyman-Pearson hypothesis testing starts with formulating descriptive models of the population. We may for instance propose (after thinking and judging) that one model (hypothesis H1) assumes that the variable has a normal distribution with μ = 100 and one model (hypothesis H2) that assumes that the variable has a normal distribution with μ = 106.  We will assume the value of σ is known, say it equals 15.  We will have to choose one of the two hypothesis, by rejecting one (and accepting the other).

Let’s suppose that only one of the models is true and that they cannot both be false. This means that we can incorrectly decide to reject or accept each of the two hypotheses.  That is, if we incorrectly reject H1, we incorrectly accept H2. So, there are two types of errors we can make. A type I error occurs when we incorrectly reject a true hypothesis and a type II error occurs when we incorrectly accept a false hypothesis.

In a previous post (here), I used the following conceptual descriptions of these errors: the type I error is the error of excessive skepticism, and the type II error is the error of  extreme gullibility, but from the perspective of Neyman-Pearson hypothesis testing these conceptual descriptions may not make much sense, because these terms imply a relation between the decisions about a hypothesis and belief in the hypothesis, while in the Neyman-Pearson approach a rejection or non-rejection does not lead to commitment in believing or not believing the hypothesis, although the hypotheses themselves are based on beliefs (and judging and reasoning) that the descriptive model is suitable for the population at hand.
The crucial point is that the goal of Neyman-Pearson hypothesis testing is to base courses of action on the decision to reject or not-reject a statistical hypothesis. This entails minimizing the costs (loss) associated with type I and type II errors. In particular, the approach minimizes the probability (β) of a type II error bounded by the probability (α) of a type I error. We may also say that we want to maximize the probability (1 – β), the probability of rejecting a false hypothesis, the so called power of the test, while keeping α at a maximum (usually low) value.
Suppose, that our considerations of the loss associated with type I and type II errors, has led us to the insight that false rejection of  H2 is the most costly error. And suppose that we have agreed/determined/reasoned/judged that the probability of falsely rejecting it should be at most .05. So, α = .05. Of course, we also  “know” the loss associated with falsely accepting it, and we have determined that the probability β should not exceed .10. Now, suppose that we repeatedly sample N = 225 observations from the (unknown) population. We do not know whether H1 or H2 provides the correct description of the population, but we assume that one of them must be true if we select a particular sample, and they cannot both be false.

We will reject H2 (Normal distribution with μ = 106, and σ = 15) if the sample mean in our random sample equals 104.35 or less (this corresponds to a test statistic with value -1.65).  Why, because the probability of obtaining a sample mean equal or smaller than 104.35 is approximately .05 when H2 is true. Thus, if we repeatedly sample from the population when H2 is true, we will incorrectly reject it in 5% of the cases. Which is the probability of a type I error that we want.

We have arranged things so, that when H2 is false, H1 is per definition true. If H1 is true (H2 is false), there is a probability of approximately .99 to obtain a sample mean of 104.35 or smaller. Thus, the probability to reject H2 when it it false is .99, this is the power of the test, and the probability is approximately .01 of incorrectly not rejecting H2 when it is false. The latter probability is the probability of a type II error, which we did not want to be larger than .10.

Now suppose the results is that the sample mean equals 103 (the value of the test statistic equals -3). According to the decision criterion we reject H2 (with α = .05) and accept H1 and act as if μ = 100 is true. Crucially, we do not have to believe it is actually true, nor do we consider the test statistic with value -3 as inductive evidence against H2. So, the test result provides neither support for H1 nor evidence against H2, but we know from the specification of the models and the assumptions about sampling that repeatedly using this procedure leads to 5% type I errors and 1%  type II errors in the long run, depending on which of the two hypotheses is true (which is unknown to us).  Given that we know the loss associated with each error, we are able to minimize the expected loss associated with acting upon the decisions we make about the hypotheses.

Note that Fisher’s significance testing would consider the p-value associated with the test statistic of -3, i.e. p < .01 either as inductive evidence against H2 or as an indication that something unusual (improbable) happened assuming H2 is true. Note also that in Fisher’s approach, it is not possible to reason from the inferred untruth of H2 to the truth of H1, because H1 does not exist in that approach.

It should be noted further that in the Neyman-Pearson approach, the importance of the value of the test statistic is restricted to whether or not the value exceeds a critical value (i.e. whether or not the value of the statistic is in the rejection region). That means that it is of no concern how much the test statistic exceeds the critical value, since all values larger than the critical value lead to the same decision: reject the hypothesis. In other words, because the approach is non-evidential, the magnitude of the test statistic is inconsequential as far as the truth of the hypothesis is concerned. Compare this to the Fisher approach, where the larger the test statistic is (the smaller the p-value), the stronger the inductive evidence is against the null-hypothesis.

Null-hypothesis significance testing (NHST)

NHST combines Fisher’s significance testing with Neyman-Pearson hypothesis testing, without regard for the logical incompatibilities of the two approaches. Fisher’s p-value is used both as a measure of inductive evidence against the null-hypothesis, with smaller p-values considered to be stronger evidence against the null than larger p-values, and as a test statistic. In its latter use, the null-hypothesis is (usually) rejected if the p-value is smaller than .05.

Contrary to significance testing, NHST uses the p-value to decide between the null-hypothesis and an alternative hypothesis. But contrary to the Neyman-Pearson approach, α, the probability of a type I error is not based on judgement and careful consideration of loss-functions, but is mechanically set at .05 (or .01). And, contrary to the Neyman-Pearson approach, the probability of a type II error (β) is usually not considered.

One reason for the latter may be that specification of the null-hypothesis is also mechanized.  In the case of differences between means or testing correlations or regression coefficients, etc, the standard null-hypothesis is that the difference, the correlation or the coefficient equals 0. This is also called the nil-hypothesis. As the alternative excludes the null, the standard alternative hypothesis is that the parameter in question is not equal to zero, which makes it hard to say something about the type II error, because determining the probability of a type II error requires thinking about a minimal consequential effect size (consequential in terms of decisions and associated loss) that can serve as the alternative hypothesis.

Specifying a non-nil alternative hypothesis, i.e. that the parameter value is not equal to zero, implies that results arbitrarily close to nil, but not equal to nil, are as consequential as effect sizes that are far away from the null-value, both in acting upon the value as in not-acting upon it. Crucially, not specifying a minimal consequential effect size, rules out determining  β. So, even though NHST uses the concept of an alternative hypothesis (contrary to Fisher), the nil-hypothesis is such that the procedure of Neyman and Pearson can no longer work: it is impossible to strike a balance between loss associated with type I and type II errors, and so NHST is not a hypothesis testing procedure.

For these reasons I am very much inclined to characterize NHST as fixed-α significance testing. But using fixed-α in combination with an evidential interpretation of p-values leads to logical inconsistencies. (As always, I assume that being logically consistent is one of the characteristics of doing science, but maybe you disagree). Note, by the way, that I am talking about the p-value as measure of evidence against the nil-hypothesis, and not about the p-value as test statistic. (But remember that proper use of the p-value as test statistic requires being able to specify a non-nil alternative hypothesis).
One of the logical inconsistencies is that α and the p-value-as-evidence involve contradictory conceptualisations of probability.  In terms of p-values, α is simply the probability that the p-value is smaller than .05 (the usual criterion) assuming the nil-hypothesis is true. That probability follows deductively from the specification of the null-hypothesis (including, of course,  the statistical model underlying it). Note that α is completely independent of actually realized results: it an assertion about the p-value assuming repeated sampling from the null-population; α is about the test-procedure and not about actual data.
But the p-value-as-evidence against the null is not the result of deductive reasoning, but of inductive reasoning. The p-value is not a probability associated with the test-procedure. It is a random variable the value of which depends on the actual data, the null-value and the statistical model. Crucially, from a single realized result (a p-value) an inference is made about a probability distribution. But this is inconsistent with the frequency interpretation of probability that underlies the conceptualisation of α, because under this interpretation no probability statement can be made about realized single results (except that the probability is 100% that it happened) or about an unrealized single result (that probability is 0 if it does not happen or 1.0 if it happens).  To make the point: using p-value-as-evidence and (fixed)-α requires both believing that probability statements can be made on the basis of a single result and believing that that is impossible.  So, it boils down to believing that both A and not-A are true.
To me, logical inconsistencies like these disqualify NHST as a scientific means of statistical inference. I repeat that this is because I believe that doing science entails being logically consistent. Assuming or believing that A and not-A are both true, is not an example of logical consistency.

Lazy Larry’s argument and the Mechanical Mind’s reply

Meet Lazy Larry, the non-critically thinking reviewer of your latest experimental result. (The story also applies to Lazy Larry’s reviews of non-experimental results). Lazy Larry does not believe your results signify anything “real”. Never mind your excellent experimental procedures and controls, and forget about your highly reliable instruments, Lazy Larry refuses to think about your results and by default dismisses them as “due to chance”.

“Due to chance” is simply a short-hand description of, say, your experimental group seems to outperform the control group on average, but that is not due to your experimental manipulation, but due to sampling error: you just happened to have randomly assigned better performing participants to the experimental group than to the control group.

Enter the Mechanical Mind. Its sole purpose is to persuade Lazy Larry that the results are not “due to chance”. Mechanical Mind has learned that Lazy Larry is quite easily persuaded (remember that Larry doesn’t think), so Mechanical Mind always does the following:

1. He pretends to have randomly assigned a random sample of participants to either the experimental or the control group. (Note the pretending is about having drawn a random sample; but since we assume an excellent experiment, we may just as well assume that the sample is in fact a random sample, but the Mechanical Mind always assumes a random sample, as part of its test procedure, even if the sample is a convenience sample).
2. He formulates a null-hypothesis that the mean population values are exactly equal to the millionth or more decimal.
3. He calculates a test statistic, say a t-value.
4. He determines a p-value:  the probability of obtaining a t-value as large as or larger than the one obtained in the experiment, under the pretense of repeated sampling from the population, assuming the null-hypothesis is true.
5. He rejects the null-hypothesis if the p-value is smaller than .05 and calls that result significant.
6. He concludes that the results are not “due to chance” and automatically takes that conclusion to mean that the effect of the experimental manipulation is “real.”

Being a non-thinker, Lazy Larry immediately agrees: if the p-value is smaller than .05, the effect is not “due to chance”, it is a real effect.

Enter a Small s Scientist. The Small s Scientist notices something peculiar. She notices that both Lazy Larry and the Mechanical Mind do not really think, which strikes her as odd. Doesn’t science involve thinking? Here we have Larry who has only one standard argument against any experimental result, and here we have the Mechanical Mind who has only one standard reply: a mindlessly performed ritual of churning out a p-value. Yes, it may shut up Lazy Larry, if the p-value happens to be smaller than .05, but the Small s Scientist is not lazy, she really thinks about experimental results.

She wonders about Lazy Larry’s argument. We have an experiment with excellent experimental procedures and controls, with highly reliable instruments, so although sampling error always has some role to play, it doesn’t immediately come to mind as a plausible explanation for the obtained effect. Again, simply assuming this by-default, is the mark of an unthinking mind.

She thinks about the Mechanical Minds procedure.  The Mechanical Mind assumes that the mean population values are completely equal up to the millionth decimal or more. Why does the Mechanical Mind assume this?  Is it really plausible that it is true? To the millionth decimal? Furthermore, she realizes that she has just read the introduction section of your paper in which you very intelligently and convincingly argue that your independent variable must have a major role to play in explaining the variation in the dependent variable. But now we have to assume that the population means are exactly the same? Reading your introduction section makes this assumption highly implausible.

She recognizes that the Mechanical Mind made you do a t-test. But is the t-test appropriate in the particular circumstances of your experiment? The assumptions of the test are that you have sampled from a normally distributed population with equal variance. Do these assumptions apply? The Mechanical Mind doesn’t seem to be bothered much about these assumptions at all. How could it? It cannot think.

She notices the definition of the p-value. The probability of obtaining a value of, in this case, the t-statistic as large as or larger than the one obtained in the experiment, assuming repeated random sampling from a population in which the null-hypothesis is true. But wait a minute, now we are assigning a probability statement to an individual event (i.e. the obtained t-statistic). Can we do that? Doesn’t a frequentist conception of probability rule out assigning probabilities to single events? Isn’t the frequentist view of probability restricted to the possibly infinite collection of single events and the frequency of occurrence of the possible values of the dependent variable? Is it logically defensible to assign probabilities to single events and at the same time make use of a frequentist conception of probability? It strikes the Small s Scientist as silly to think it is.

She understands why the Mechanical Mind focuses on the probability of obtaining results (under repeated sampling from the null-population) as extreme as or extremer than the one obtained. It is simply that any obtained result has a very low probability (if not 0; e.g. if the dependent variable is continuous), no matter the hypothesis.  So, the probability of a single obtained t-statistic is so low to be inconsistent with every hypothesis.  But why, she wonders, do we need to consider all the results that were not obtained (i.e. the more extreme results) in determining whether a “due to chance” explanation has some plausibility (remember that the “due to chance” argument does not seem to be very plausible to begin with)? Why, she wonders, do we not restrict ourselves to the data that were actually obtained?

The Small s Scientist gets a little frustrated when thinking about why a null-hypothesis can be rejected if p < .05 and not when p > .05. What is the scientific justification of using this criterion? She has read a lot about statistics but never found a justification of using .05, apart from Fisher claiming that .05 is convenient, which is not really a justification. It doesn’t seem to be very scientific to justify a critical value simply by saying that Fisher said so. Of course, the Small s Scientist knows about decision procedures a la Neyman and Pearson’s hypothesis testing in which setting α can be done on a rational basis by considering loss functions, but considering loss functions is not part of the Mechanical Mind’s procedure. Besides, is the purpose of the Mechanical Mind’s procedure not to counter the “due to chance” explanation, by providing evidence against it, in stead of deciding whether or not the result is due to chance? In any case, the 5% criterion is an unjustified criterion, and using 5% by-default is, let’s repeat it again, the mark of an unthinking mind.

The final part of the Mechanical Mind’s procedure strikes the Small s Scientist as embarrassingly silly. Here we see a major logical error. The Mechanical Mind assumes, and Lazy Larry seems to believe, that a low p-value (according to an unjustified convention of .05) entails that results are not “due to chance” whereas a high p-value means that the results are “due to chance”, and therefore not real. Maybe it should not surprise us that unthinking minds, mechanical, lazy, or both, show signs of illogical reasoning, but it seems to the Small s Scientist that illogical thinking has no part to play in doing science.

The logical error is the error of the transposed conditional. The conditional is: If the null-hypothesis (and all other assumptions, including repeated random sampling) is/are true, the probability of obtaining a t-statistic as large as or larger than the one obtained in the experiment is p. That is, if all of the obtained t-statistics in repeated samples are “due to chance”, the probability of obtaining one as large as or larger than the one obtained in the experiment equals p.  It’s incorrect transpose is: if the p-value is small, than the null-hypothesis is not true (i.e. the results are not “due to chance”).  Which is very close to: If the null-hypothesis is true, these results (or more extreme results) do not happen very often” to  “If these results happen, the null-hypothesis is not true”.  More abstractly the Mechanical Mind goes from “If H, than probably not R” to “If R, than probably not H”, where R stands for results and H for the null-hypothesis.”.

To sum up. The Small s Scientist believes that science involves thinking. The Mechanical Mind’s procedure is an unthinking reply to Lazy Larry’s standard argument that experimental results are “due to chance”. The Small s Scientist tries to think beyond that standard argument and finds many troubling aspects of the Mechanical Mind’s procedure. Here are the main points.

1. The plausibility of the null-hypothesis of exactly equal population  means can not be taken for granted. Like every hypothesis it requires justification.
2. The choice for a test statistic can not be automatically determined. Like every methodological choice it requires justification.
3. The interpretation of the p-value as a measure of evidence against the “due to chance” argument requires assigning a probability statement to a single event. This is not possible from a frequentist conception of probability. So, doing so, and simultaneously holding  a frequentist conception of probability means that the procedure is logically inconsistent. The Small s Scientist does not like logical inconsistency in scientific work.
4.  The p-value as a measure of evidence, includes “evidence” not actually obtained. How can a “due to chance” explanation (as implausible as it often is) be discredited on the basis of evidence that was not obtained?
5. The use of a criterion of .05 is unjustified, so even if we allow logical inconsistency in the interpretation of the p-value (i.e. assigning a probability statement to a single event), which a Small s Scientist does not, we still need a scientific justification of that criterion. The Mechanical Mind’s procedure does not provide such a justification.
6. A large p-value does not entail that the results “are due to chance”.  A p-value cannot be used to distinguish “chance” results from “non-chance” results. The underlying reasoning is invalid, and a Small s Scientist does not like invalid reasoning in scientific work.

Type I error probability does not destroy the evidence in your data

Have you heard about that experimental psychologist? He decided that his participants did not exist, because the probability of selecting them, assuming they exist, was very small indeed (p < .001). Fortunately, his colleagues were quick to reply that he was mistaken. He should decide that they do exist, because the probability of selecting them, assuming they do not exist, is very remote (p < .001). Yes, even unfunny jokes can be telling about the silliness of significance testing.

But sometimes the silliness is more subtle, for instance in a recent blog post by Daniel Lakens, the 20% Statistician with the title “Why Type I errors are more important than Type 2 errors (if you care about evidence).” The logic of his post is so confused, that I really do not know where to begin. So, I will aim at his main conclusion that type I error inflation quickly destroys the evidence in your data.

(Note: this post uses mathjax and I’ve found out that this does not really work well on a (well, my) mobile device. It’s pretty much unreadable).

Lakens seems to believe that the long term error probabilities associated with decision procedures, has something to do with the actual evidence in your data. What he basically does is define evidence as the ratio of power to size (i.e. the probability of a type I error), it’s basically a form of the positive likelihood ratio

which makes it plainly obvious that manipulating (for instance by multiplying it with some constant c) influences the PLR more than manipulating by the same amount.  So, his definition of  “evidence” makes part of his conclusion true, by definition:   has more influence on the PLR than ,  But it is silly to reason on the basis of this that the type I error rate destroys the evidence in your data.

The point is that  and (or the probabilities of type I errors and type II errors) have nothing to say about the actual evidence in your data. To be sure, if you commit one of these errors, it is the data (in NHST combined with arbitrary i,e, unjustified cut-offs) that lead you to these errors. Thus, even and , do not guarantee that actual data lead to a correct decision.

Part of the problem is that Lakens confuses evidence and decisions, which is a very common confusion in NHST practice. But, deciding to reject a null-hypothesis, is not the same as having evidence against it (there is this thing called type I error). It seems that NHST-ers and NHST apologists find this very very hard to understand. As my grandmother used to say: deciding that something is true, does not make it true

I will try to make plausible that decisions are not evidence (see also my previous post here). This should be enough to show you that error probabilities associated with the decision procedure tells you nothing about the actual evidence in your data. In other words, this should be enough to convince you that Type 1 error rate inflation does not destroy the evidence in your data, contrary to the 20% Statistician’s conclusion.

Let us consider whether the frequency of correct (or false) decisions is related to the evidence in the data. Suppose I tell you that I have a Baloney Detection Kit (based for example on the baloney detection kit at skeptic.com) and suppose I tell you that according to my Baloney Detection Kit the 20% Statistician’s post is, well, Baloney. Indeed, the quantitative measure (amount of Baloneyness) I use to make the decision is well above the critical value. I am pretty confident about my decision to categorize the post as Baloney as well, because my decision procedure rarely leads to incorrect decisions. The probability that I decide that something is Baloney when it is not is only and the probability that I decide that something is not-Baloney when it is in fact Baloney is only 1% as well ().

Now, the 20% Statistician’s conclusion states that manipulating , for instance by setting destroys the evidence in my data. Let’s see. The evidence in my data is of course the amount of Baloneyness of the post. (Suppose my evidence is that the post contains 8 dubious claims). How does setting have any influence on the amount of Baloneyness? The only thing setting does is influence the frequency of incorrect decisions to call something Baloney when it is not. No matter what value of (or , for that matter) we use, the amount of Baloneyness in this particular post (i.e. the evidence in the data) is 8 dubious claims.

To be sure, if you tell the 20% Statistician that his post is Baloney, he will almost certainly not ask you how many times you are right and wrong on the long run (characteristics of the decision procedure), he will want to see your evidence. Likewise, he will probably not argue that your decision procedure is inadequate for the task at hand (maybe it is applicable to science only and not to non-scientific blog posts), but he will argue about the evidence (maybe by simply deciding (!) that what you are saying is wrong; or by claiming that the post does not contain 8 dubious claims, but only 7).

The point is, of course, this: the long term error probabilities and associated with the decision procedure, have no influence on the actual evidence in your data.  The conclusion of the 20% Statistician is simply wrong. Type I error inflation does not destroy the evidence in your data, nor does type II error inflation.