"p-values are dangerous,
especially large, small, and in-between ones."
- Frank E. Harrell Jr., Prof. of Biostatistics and Department Chair, Vanderbilt University
No topic in Engineering Statistics is as poorly understood as "p-values."
It is even more widely misused in other scientific endeavors, such as
psychology and even medicine. And the practitioners (we engineers,
psychologists and physicians) are not to blame - the
fault lies entirely with statistical educators who insist on teaching the
comparatively useless and outmoded concept of Hypothesis Testing, rather than
the utilitarian, and much more easily understood, concepts of parameter
estimation with confidence intervals.
The sad thing is that many scientific journals (even some well respected
ones) won't accept an article unless it has the obligatory p-value
to demonstrate that the result is "significant." Often what the result
actually is (the size of the improvement in pharmacological efficacy, for
example) is subsumed by all the folderol over its associated p-value. And when the value of interest is reported, seldom is a confidence
interval provided that would tell the reader how strongly to believe what is
We are letting the tail wag the dog here, and it is time to
re-order our intellectual priorities from being obsessed with meaningless
hypothesis tests to concentrating on the magnitude and believability of the
thing we are actually interested in.
Every engineer has suffered through Engineering Statistics 101, and learned
- statistics is inhumanly boring,
- statistics is memorizing stuff you could easily look up
since it's all cook-book anyway,
- a smaller p-value is better because a
p-value is the probability that the null hypothesis is true.
I hope to convince you that all of these are untrue. Two
observations of W. Edwards Deming come to mind:
- "Under the usual teaching, the trusting
student, to pass the course, must forsake all the scientific sense that he
has accumulated so far, and learn the book, mistakes and all."
- "Small wonder that students have trouble [with
statistical hypothesis testing]. They may be trying to think."
An interesting analog to p-value misunderstanding is R.P. Carver's
"What is the probability of obtaining a dead person (D) given that the
person was hanged (H); that is, in symbol form, what is p(D|H)?
Obviously, it will be very high, perhaps .97 or higher.
"Now, let us reverse the question: What is the probability that a person has
been hanged (H) given that the person is dead (D); that is, what is p(H|D)? This time the probability will undoubtedly be very low, perhaps .01 or
"No one would be likely to make the mistake of substituting the first
estimate (.97) for the second (.01); that is, to accept .97 as the
probability that a person has been hanged given that the person is dead.
"Even thought this seems to be an unlikely mistake, it is exactly the kind of
mistake that is made with the interpretation of statistical significance
testing - by analogy, calculated estimates of p(H|D) are interpreted as if
they were estimates of p(D|H), when they are clearly not the same." -
Said differently: It is wrong to interpret a p-value as the
probability that the null hypothesis is true.
Well, if it doesn't mean that, what does it mean? Answer: Not much.
In classical hypothesis testing you must declare beforehand what
"significance level" you require. By convention (and not by celestial
edict) that is 0.05, which is to say we would expect only 5 in 100 outcomes
to be as extreme as what is observed, IF the null hypothesis is true.
(The null hypothesis, H0, is what you don't want to be true. - Is it any
wonder engineers hate statistics?) So you observe a p-value of 0.001.
Since that is more extreme than p=0.05, you conclude that it is unlikely (at the 5%
level) that H0 is true.
You cannot conclude that the probability that H0 is
true is 0.001, which is why you must pre-declare your desired significance
level. WAIT! How can it possibly matter when I decide
on significance?! Ah! but it does. If you want to
determine the probability that your alternative hypothesis, Ha, is true, then you must
become a Bayesian. Frequentist hypothesis testing is only useful for assessing
the behavior of what you don't think is true to begin with.
Now, why fool around with that kind of statistical double-talk?
Forget about p-values and hypothesis testing. Instead, estimate the
most likely value of what you are interested in (a parameter, or a
difference, say) and then compute its confidence interval.
If the interval includes zero, you would infer that the results could
have reasonably happened by chance, but in any event you will have a value
for something you are interested in and you will have a handle on how
seriously you should believe it.
- William Edwards Deming (1900 - 1993) was an American statistician,
professor, author, lecturer, and consultant, perhaps best known for his work
helping rebuild Japanese industry after WWII through the application of
statistical methods, and thus changing the image of "made in Japan" from
meaning "cheap" to meaning "high quality."
- W. E. Deming, (1975) "On probability as a basis for action." American Statistician
- Carver, R.P. (1978) "The case against statistical testing." Harvard
Educational Review 48: 378-399.
- Ziliak and McCloskey (2008) The Cult of Statistical Significance
- How the Standard Error Costs Us Jobs, Justice, and Lives,
University of Michigan Press.
While I do not agree with the authors in their vilification of R.A. Fisher
(whom I believe they misunderstand) I do concur with their fundamental
thesis that hypothesis testing has become a cult.