Meet Lazy Larry, the non-critically thinking reviewer of your latest experimental result. (The story also applies to Lazy Larry’s reviews of non-experimental results). Lazy Larry does not believe your results signify anything “real”. Never mind your excellent experimental procedures and controls, and forget about your highly reliable instruments, Lazy Larry refuses to think about your results and by default dismisses them as “due to chance”.
“Due to chance” is simply a short-hand description of, say, your experimental group seems to outperform the control group on average, but that is not due to your experimental manipulation, but due to sampling error: you just happened to have randomly assigned better performing participants to the experimental group than to the control group.
Enter the Mechanical Mind. Its sole purpose is to persuade Lazy Larry that the results are not “due to chance”. Mechanical Mind has learned that Lazy Larry is quite easily persuaded (remember that Larry doesn’t think), so Mechanical Mind always does the following:
- He pretends to have randomly assigned a random sample of participants to either the experimental or the control group. (Note the pretending is about having drawn a random sample; but since we assume an excellent experiment, we may just as well assume that the sample is in fact a random sample, but the Mechanical Mind always assumes a random sample, as part of its test procedure, even if the sample is a convenience sample).
- He formulates a null-hypothesis that the mean population values are exactly equal to the millionth or more decimal.
- He calculates a test statistic, say a t-value.
- He determines a p-value: the probability of obtaining a t-value as large as or larger than the one obtained in the experiment, under the pretense of repeated sampling from the population, assuming the null-hypothesis is true.
- He rejects the null-hypothesis if the p-value is smaller than .05 and calls that result significant.
- He concludes that the results are not “due to chance” and automatically takes that conclusion to mean that the effect of the experimental manipulation is “real.”
Being a non-thinker, Lazy Larry immediately agrees: if the p-value is smaller than .05, the effect is not “due to chance”, it is a real effect.
Enter a Small s Scientist. The Small s Scientist notices something peculiar. She notices that both Lazy Larry and the Mechanical Mind do not really think, which strikes her as odd. Doesn’t science involve thinking? Here we have Larry who has only one standard argument against any experimental result, and here we have the Mechanical Mind who has only one standard reply: a mindlessly performed ritual of churning out a p-value. Yes, it may shut up Lazy Larry, if the p-value happens to be smaller than .05, but the Small s Scientist is not lazy, she really thinks about experimental results.
She wonders about Lazy Larry’s argument. We have an experiment with excellent experimental procedures and controls, with highly reliable instruments, so although sampling error always has some role to play, it doesn’t immediately come to mind as a plausible explanation for the obtained effect. Again, simply assuming this by-default, is the mark of an unthinking mind.
She thinks about the Mechanical Minds procedure. The Mechanical Mind assumes that the mean population values are completely equal up to the millionth decimal or more. Why does the Mechanical Mind assume this? Is it really plausible that it is true? To the millionth decimal? Furthermore, she realizes that she has just read the introduction section of your paper in which you very intelligently and convincingly argue that your independent variable must have a major role to play in explaining the variation in the dependent variable. But now we have to assume that the population means are exactly the same? Reading your introduction section makes this assumption highly implausible.
She recognizes that the Mechanical Mind made you do a t-test. But is the t-test appropriate in the particular circumstances of your experiment? The assumptions of the test are that you have sampled from a normally distributed population with equal variance. Do these assumptions apply? The Mechanical Mind doesn’t seem to be bothered much about these assumptions at all. How could it? It cannot think.
She notices the definition of the p-value. The probability of obtaining a value of, in this case, the t-statistic as large as or larger than the one obtained in the experiment, assuming repeated random sampling from a population in which the null-hypothesis is true. But wait a minute, now we are assigning a probability statement to an individual event (i.e. the obtained t-statistic). Can we do that? Doesn’t a frequentist conception of probability rule out assigning probabilities to single events? Isn’t the frequentist view of probability restricted to the possibly infinite collection of single events and the frequency of occurrence of the possible values of the dependent variable? Is it logically defensible to assign probabilities to single events and at the same time make use of a frequentist conception of probability? It strikes the Small s Scientist as silly to think it is.
She understands why the Mechanical Mind focuses on the probability of obtaining results (under repeated sampling from the null-population) as extreme as or extremer than the one obtained. It is simply that any obtained result has a very low probability (if not 0; e.g. if the dependent variable is continuous), no matter the hypothesis. So, the probability of a single obtained t-statistic is so low to be inconsistent with every hypothesis. But why, she wonders, do we need to consider all the results that were not obtained (i.e. the more extreme results) in determining whether a “due to chance” explanation has some plausibility (remember that the “due to chance” argument does not seem to be very plausible to begin with)? Why, she wonders, do we not restrict ourselves to the data that were actually obtained?
The Small s Scientist gets a little frustrated when thinking about why a null-hypothesis can be rejected if p < .05 and not when p > .05. What is the scientific justification of using this criterion? She has read a lot about statistics but never found a justification of using .05, apart from Fisher claiming that .05 is convenient, which is not really a justification. It doesn’t seem to be very scientific to justify a critical value simply by saying that Fisher said so. Of course, the Small s Scientist knows about decision procedures a la Neyman and Pearson’s hypothesis testing in which setting α can be done on a rational basis by considering loss functions, but considering loss functions is not part of the Mechanical Mind’s procedure. Besides, is the purpose of the Mechanical Mind’s procedure not to counter the “due to chance” explanation, by providing evidence against it, in stead of deciding whether or not the result is due to chance? In any case, the 5% criterion is an unjustified criterion, and using 5% by-default is, let’s repeat it again, the mark of an unthinking mind.
The final part of the Mechanical Mind’s procedure strikes the Small s Scientist as embarrassingly silly. Here we see a major logical error. The Mechanical Mind assumes, and Lazy Larry seems to believe, that a low p-value (according to an unjustified convention of .05) entails that results are not “due to chance” whereas a high p-value means that the results are “due to chance”, and therefore not real. Maybe it should not surprise us that unthinking minds, mechanical, lazy, or both, show signs of illogical reasoning, but it seems to the Small s Scientist that illogical thinking has no part to play in doing science.
The logical error is the error of the transposed conditional. The conditional is: If the null-hypothesis (and all other assumptions, including repeated random sampling) is/are true, the probability of obtaining a t-statistic as large as or larger than the one obtained in the experiment is p. That is, if all of the obtained t-statistics in repeated samples are “due to chance”, the probability of obtaining one as large as or larger than the one obtained in the experiment equals p. It’s incorrect transpose is: if the p-value is small, than the null-hypothesis is not true (i.e. the results are not “due to chance”). Which is very close to: If the null-hypothesis is true, these results (or more extreme results) do not happen very often” to “If these results happen, the null-hypothesis is not true”. More abstractly the Mechanical Mind goes from “If H, than probably not R” to “If R, than probably not H”, where R stands for results and H for the null-hypothesis.”.
To sum up. The Small s Scientist believes that science involves thinking. The Mechanical Mind’s procedure is an unthinking reply to Lazy Larry’s standard argument that experimental results are “due to chance”. The Small s Scientist tries to think beyond that standard argument and finds many troubling aspects of the Mechanical Mind’s procedure. Here are the main points.
- The plausibility of the null-hypothesis of exactly equal population means can not be taken for granted. Like every hypothesis it requires justification.
- The choice for a test statistic can not be automatically determined. Like every methodological choice it requires justification.
- The interpretation of the p-value as a measure of evidence against the “due to chance” argument requires assigning a probability statement to a single event. This is not possible from a frequentist conception of probability. So, doing so, and simultaneously holding a frequentist conception of probability means that the procedure is logically inconsistent. The Small s Scientist does not like logical inconsistency in scientific work.
- The p-value as a measure of evidence, includes “evidence” not actually obtained. How can a “due to chance” explanation (as implausible as it often is) be discredited on the basis of evidence that was not obtained?
- The use of a criterion of .05 is unjustified, so even if we allow logical inconsistency in the interpretation of the p-value (i.e. assigning a probability statement to a single event), which a Small s Scientist does not, we still need a scientific justification of that criterion. The Mechanical Mind’s procedure does not provide such a justification.
- A large p-value does not entail that the results “are due to chance”. A p-value cannot be used to distinguish “chance” results from “non-chance” results. The underlying reasoning is invalid, and a Small s Scientist does not like invalid reasoning in scientific work.