SE_003399  SE logo

p-values

"p-values are dangerous, especially large, small, and in-between ones."

- Frank E. Harrell Jr., Prof. of Biostatistics and Department Chair, Vanderbilt University

 

No topic in Engineering Statistics is as poorly understood as "p-values."  It is even more widely misused in other scientific endeavors, such as psychology and even medicine.  And the practitioners (we engineers, psychologists and physicians) are not to blame - the fault lies entirely with statistical educators who insist on teaching the comparatively useless and outmoded concept of Hypothesis Testing, rather than the utilitarian, and much more easily understood, concepts of parameter estimation with confidence intervals.

The sad thing is that many scientific journals (even some well respected ones) won't accept an article unless it has the obligatory p-value to demonstrate that the result is "significant."  Often what the result actually is (the size of the improvement in pharmacological efficacy, for example) is subsumed by all the folderol over its associated p-value.  And when the value of interest is reported, seldom is a confidence interval provided that would tell the reader how strongly to believe what is reported. 

We are letting the tail wag the dog here, and it is time to re-order our intellectual priorities from being obsessed with meaningless hypothesis tests to concentrating on the magnitude and believability of the thing we are actually interested in.

 

Every engineer has suffered through Engineering Statistics 101, and learned several things:

I hope to convince you that all of these are untrue.  Two observations of W. Edwards Deming come to mind:

An interesting analog to p-value misunderstanding is R.P. Carver's 1978 parable:

"What is the probability of obtaining a dead person (D) given that the person was hanged (H); that is, in symbol form, what is p(D|H)?  Obviously, it will be very high, perhaps .97 or higher.

"Now, let us reverse the question: What is the probability that a person has been hanged (H) given that the person is dead (D); that is, what is p(H|D)?  This time the probability will undoubtedly be very low, perhaps .01 or lower. 

"No one would be likely to make the mistake of substituting the first estimate (.97) for the second (.01); that is, to accept .97 as the probability that a person has been hanged given that the person is dead.

"Even thought this seems to be an unlikely mistake, it is exactly the kind of mistake that is made with the interpretation of statistical significance testing - by analogy, calculated estimates of p(H|D) are interpreted as if they were estimates of p(D|H), when they are clearly not the same." - (Carver 1978)

Said differently: It is wrong to interpret a p-value as the probability that the null hypothesis is true. 

Well, if it doesn't mean that, what does it mean?  Answer: Not much.  In classical hypothesis testing you must declare beforehand what "significance level" you require.  By convention (and not by celestial edict) that is 0.05, which is to say we would expect only 5 in 100 outcomes to be as extreme as what is observed, IF the null hypothesis is true.  (The null hypothesis, H0, is what you don't want to be true. - Is it any wonder engineers hate statistics?)  So you observe a p-value of 0.001.  Since that is more extreme than p=0.05, you conclude that it is unlikely (at the 5% level) that H0 is true. 

You cannot conclude that the probability that H0 is true is 0.001, which is why you must pre-declare your desired significance level.  WAIT!  How can it possibly matter when I decide on significance?!  Ah!  but it does.  If you want to determine the probability that your alternative hypothesis, Ha, is true, then you must become a Bayesian.  Frequentist hypothesis testing is only useful for assessing the behavior of what you don't think is true to begin with.

Now, why fool around with that kind of statistical double-talk?  Forget about p-values and hypothesis testing.  Instead, estimate the most likely value of what you are interested in (a parameter, or a difference, say) and then compute its confidence interval.

If the interval includes zero, you would infer that the results could have reasonably happened by chance, but in any event you will have a value for something you are interested in and you will have a handle on how seriously you should believe it.

Notes: