*per se*. We call this compromise the “hybrid theory” of statistical inference, and it goes without saying the neither Fisher nor Neyman and Pearson would have looked with favor on this offspring of their forced marriage.” (p. 123, italics in original).

*hybrid*statistical theory, as described in section 3.4. Of course, this meant doing the impossible.” (p. 219, italics in original).

### Fisher’s significance testing

**evidential**. Crucially, the evidence the data provides can only be evidence against a statistical model, but it can not be evidence in favour of the model, much in line with Popper’s idea of progress in science by means of falsification. The statistical model to be nullified, i.e. the model one tries to obtain evidence against, is called the null-hypothesis.

Conceptually, the statistical model is a descriptive model of a population of possible values. An important part of Fisher’s approach is therefore to judge what kind of model provides an appropriate model of the population. For instance, this process of formulating the model (which, of course, involves a lot of thought and judgement) may lead one to assume that the random variable has a normal distribution, which is characterized by only two parameters, μ the expected value or mean of the distribution and σ, the square root of the variance of the distribution, which in the case of the normal distribution is it’s standard deviation (the standard deviation is the square root of the variance).

The values of μ and σ (or σ^{2}) are generally unknown, but we may assume (again as a result of thinking and judging) that they have particular values. For reasons of exposition, I will now assume that the value of σ is known, say σ = 15, so that we only have to take the unknown value of μ into account. Let’s suppose that our thinking and judging has led us to assume that the unknown value of μ = 100. The null-hypothesis is therefore that the variable has a normal distribution with μ = 100, and σ = 15.

We can obtain evidence against this null-hypothesis, by determining a p-value. We first gather data, say we take a random sample of N = 225 participants, which enables us to obtain observed values of the variable. Next, we calculate a test statistic, for example by estimating the value of μ (on the basis of our data) subtracting the hypothesized value and dividing the estimate by it’s standard error. Our estimated value may for example be 103, and the standard error equals 15 / √225 = 1.0, so the value of the test statistic equals (103 – 100) / 1 = 3. And now we are ready to calculate the p-value.

The p-value is the probability of obtaining (when sampling repeatedly) a value of the test statistic as large as or larger than the one obtained in the study, provided that the null-hypothesis is true. This probability can be calculated because the exact distribution of the test statistic can be deduced from the specification of the null-hypothesis. In our example, the test statistic is approximately normally distributed with μ = 0, and σ = 1.0. (The distribution is approximately normal, assuming the null-hypothesis is true, so the p-value in our example not exact). The p-value equals 0.003. (This is the so-called two-sided p-value, it is the probability of obtaining a value equal or larger than 3 or equal of smaller than -3, but we will ignore the technicalities of two-sided tests).

Following Fisher, this low p-value can be interpreted as that something “improbable” occurred (assuming the null-hypothesis is true) or as inductive evidence against the null-hypothesis, i.e. the null-hypothesis is not true.

### Neyman-Pearson hypothesis testing

**non-evidential**. Its primary goal is to choose on the basis of repeated random sampling between two hypotheses (or more; but I will only consider two) in order to make behavioral decisions (so to speak) that will minimize decision errors and their associated costs (loss) in the long run. In stead of trying to figure out which of the two hypotheses is true, one decides to accept one (and reject the other) of the two hypothesis

*as if it were true, without actually having to believe it*, and act accordingly.

_{1}) assumes that the variable has a normal distribution with μ = 100 and one model (hypothesis H

_{2}) that assumes that the variable has a normal distribution with μ = 106. We will assume the value of σ is known, say it equals 15. We will have to choose one of the two hypothesis, by rejecting one (and accepting the other).

Let’s suppose that __only one of the models is true and that they cannot both be false.__ This means that we can incorrectly decide to reject or accept each of the two hypotheses. That is, if we incorrectly reject H_{1}, we incorrectly accept H_{2}. So, there are two types of errors we can make. A type I error occurs when we incorrectly reject a true hypothesis and a type II error occurs when we incorrectly accept a false hypothesis.

_{2}is the most costly error. And suppose that we have agreed/determined/reasoned/judged that the probability of falsely rejecting it should be at most .05. So, α = .05. Of course, we also “know” the loss associated with falsely accepting it, and we have determined that the probability β should not exceed .10. Now, suppose that we repeatedly sample N = 225 observations from the (unknown) population. We do not know whether H

_{1}or H

_{2}provides the correct description of the population, but we assume that one of them must be true if we select a particular sample, and they cannot both be false.

We will reject H_{2} (Normal distribution with μ = 106, and σ = 15) if the sample mean in our random sample equals 104.35 or less (this corresponds to a test statistic with value -1.65). Why, because the probability of obtaining a sample mean equal or smaller than 104.35 is approximately .05 when H_{2} is true. Thus, if we repeatedly sample from the population when H_{2} is true, we will incorrectly reject it in 5% of the cases. Which is the probability of a type I error that we want.

_{2}is false, H

_{1}is per definition true. If H

_{1}is true (H

_{2}is false), there is a probability of approximately .99 to obtain a sample mean of 104.35 or smaller. Thus, the probability to reject H

_{2}when it it false is .99, this is the power of the test, and the probability is approximately .01 of incorrectly not rejecting H

_{2}when it is false. The latter probability is the probability of a type II error, which we did not want to be larger than .10.

Now suppose the results is that the sample mean equals 103 (the value of the test statistic equals -3). According to the decision criterion we reject H_{2} (with α = .05) and accept H_{1} and act as if μ = 100 is true. Crucially, we do not have to believe it is actually true, nor do we consider the test statistic with value -3 as inductive evidence against H_{2}. So, the test result provides neither support for H_{1} nor evidence against H_{2}, but we know from the specification of the models and the assumptions about sampling that repeatedly using this procedure leads to 5% type I errors and 1% type II errors in the long run, depending on which of the two hypotheses is true (which is unknown to us). Given that we know the loss associated with each error, we are able to minimize the expected loss associated with acting upon the decisions we make about the hypotheses.

Note that Fisher’s significance testing would consider the p-value associated with the test statistic of -3, i.e. p < .01 either as inductive evidence against H_{2} or as an indication that something unusual (improbable) happened assuming H_{2} is true. Note also that in Fisher’s approach, it is not possible to reason from the inferred untruth of H_{2} to the truth of H_{1}, because H_{1} does not exist in that approach.

It should be noted further that in the Neyman-Pearson approach, the importance of the value of the test statistic is restricted to whether or not the value exceeds a critical value (i.e. whether or not the value of the statistic is in the rejection region). That means that it is of no concern how much the test statistic exceeds the critical value, since all values larger than the critical value lead to the same decision: reject the hypothesis. In other words, because the approach is non-evidential, the magnitude of the test statistic is inconsequential as far as the truth of the hypothesis is concerned. Compare this to the Fisher approach, where the larger the test statistic is (the smaller the p-value), the stronger the inductive evidence is against the null-hypothesis.

### Null-hypothesis significance testing (NHST)

NHST combines Fisher’s significance testing with Neyman-Pearson hypothesis testing, without regard for the logical incompatibilities of the two approaches. Fisher’s p-value is used both as a measure of inductive evidence against the null-hypothesis, with smaller p-values considered to be stronger evidence against the null than larger p-values, and as a test statistic. In its latter use, the null-hypothesis is (usually) rejected if the p-value is smaller than .05.

Contrary to significance testing, NHST uses the p-value to decide between the null-hypothesis and an alternative hypothesis. But contrary to the Neyman-Pearson approach, α, the probability of a type I error is not based on judgement and careful consideration of loss-functions, but is mechanically set at .05 (or .01). And, contrary to the Neyman-Pearson approach, the probability of a type II error (β) is usually not considered.

One reason for the latter may be that specification of the null-hypothesis is also mechanized. In the case of differences between means or testing correlations or regression coefficients, etc, the standard null-hypothesis is that the difference, the correlation or the coefficient equals 0. This is also called the nil-hypothesis. As the alternative excludes the null, the standard alternative hypothesis is that the parameter in question is not equal to zero, which makes it hard to say something about the type II error, because determining the probability of a type II error requires thinking about a minimal consequential effect size (consequential in terms of decisions and associated loss) that can serve as the alternative hypothesis.

Specifying a non-nil alternative hypothesis, i.e. that the parameter value is not equal to zero, implies that results arbitrarily close to nil, but not equal to nil, are as consequential as effect sizes that are far away from the null-value, both in acting upon the value as in not-acting upon it. Crucially, not specifying a minimal consequential effect size, rules out determining β. So, even though NHST uses the concept of an alternative hypothesis (contrary to Fisher), the nil-hypothesis is such that the procedure of Neyman and Pearson can no longer work: it is impossible to strike a balance between loss associated with type I and type II errors, and so NHST is not a hypothesis testing procedure.