Noah Smith, popular economics blogger, recently posted a rebuttal to the criticism on the use of p-values in hypothesis testing. While he makes a few good points on why p-values and significance testing have value, I think that his post fails to address a couple of major issues.

First, he states that good science involves replication of results. This is absolutely true, and is, in my opinion, the best antidote for many of the issues related to significance testing. But from my experience in academia (I was an engineering grad student from 2003-2008), the real problem isn’t the lack of good scientists, it’s the poor system of incentives. This extends from the management of journals to the peer review process to the tenure system itself.

Because journals are reluctant to publish negative (non-significant) results, if dozens of independent groups perform similar studies, but only one of these shows significance, this single positive may be the only one published. In fact, the groups that found non-significance will probably not even attempt to publish their work, and no one will have any reason to believe that the lone positive result is false. In this case, no researcher has to do anything wrong in order to produce a bad conclusion by the field.

Also, the tenure system requires that professors continually publish papers in respected journals, which requires doing original work and finding interesting, previously unknown effects. Replicating studies that others have already accepted as legitimate (whether your own or not) gets you no closer to tenure.

The other major problem with p-values is the way they’re interpreted. The common perception is that a p-value of 0.05 means there’s a 95% chance the effect is real (non-random). But the p-value actually represents p(x|h0), where x is the data and h0 is the null hypothesis. What the researcher wants to know is p(h0|x). The first value (what you have) tells you the probability of observing the data you found, assuming that the null hypothesis is true. But you want to know the probability of the null hypothesis, given the data.

Bayes’ theorem could be used to convert from the term you have to the one you want if you knew p(x), the prior probability of the data. Unfortunately, there’s no way to find this value. However, this paper does a nice job of setting bounds on the value of p(h0|x), depending on the form of the distribution on the data. An interesting result from this work is that for many types of tests, simply subtracting 1 from the t-stat will give you a decent approximation.

Kevin, saw this blogger link from your twitter page. Have you seen this example: http://andrewgelman.com/wp-content/uploads/2015/07/morris_example.pdf

I saw it on this post: http://goo.gl/rX77ch

-Barret

Thanks for posting, I hadn’t seen that. Good example of a situation where an unexpected result (based on the prior) gives you a misleading p-value, especially with limited observations.