Who says statistics aren’t funny? Watch top scientists try to define ‘P Value.’

This video was taken by Christie Aschwanden, the science editor for the statistical analysis site, fivethirtyeight.com while attending the METRICS conference at Stanford University. The conference brought together many of the world’s top experts on meta-science, or the study of studies.
She asked these top statisticians to define ‘P-Value’—arguably the single most powerful statistic in all of published research. Each time she asked the question, the brains of these folks started to heat up, the gears start clanging and more often than not, they broke down laughing.
They could not do it. It’s hilarious to watch.
Steve Goodman, M.D., MHS, Ph.D., Associate Dean of Clinical and Translational Research and Professor of Medicine and Health Research and Policy at Stanford University, said on camera: “Well, I’ve actually spent my entire career about the definition of ‘P-Value’ but I cannot tell you what it means and almost nobody can.”
If these top scientists struggle to define ‘P-Value’, what hope does the rank and file surgeon have?
Ticket to Publish
For the record, the definition of ‘P-Value’ is: “The level of marginal significance within a statistical hypothesis test, representing the probability of the occurrence of a given event.”
In plain English: P-Value is the difference between your hypothesis and the devil’s advocate’s hypothesis—when the devil’s advocate is saying that your treatment is no better than the control treatment.
But in research’s back alleys ‘P-Value’ is something else. It’s your ticket to be published in a peer-reviewed journal.
‘P-Value’ and Clinically Relevant Information: Two Ships Passing in the Night
So you want to be published in a peer-reviewed journal? OK. Your ‘P-Value’ needs to be <0.05. If it is higher, you’re probably toast.
A ‘P-Value’ of <0.05 means that if your treatment had no effect, you’d obtain the observed difference or more in less than 5% of the studies due to random sampling error.
But don’t worry. There are many techniques which can ensure your ‘P-Value’ passes muster. Here’s a game you can play which illustrates this point.
In this game, you are asked to pick which political party—Republican or Democrat—is better for the economy. You have ten variables to choose from. Choose at least two.
There are 1, 800 possible variations of the ten data sets to choose from in this game and 1, 078 will give your study a ‘P-Value’ of less than 0.05 and therefore your study is PUBLISHABLE.
If at first you don’t succeed (your ‘P-Value” is over 0.05 and is, therefore, UN-PUBLISHABLE) try a different combination of data sets. In no time at all, you get the ‘P-Value’ of your dreams—and therefore can publish, get tenure, and so forth.
All you have to do is find the right data sets.
Answers vs. Results
If ‘P-Value’ can be dialed in by selecting which data sets to use in a study, are we not essentially seeking results rather than clinically relevant answers?
One peer-review journal has apparently pondered that question and decided to ban all Null Hypothesis Significance Testing Protocol measures—like ‘P-Value’—from their journal.
In February 2015 the editors of Basic and Applied Social Psychology announced that they will no longer publish ‘P-Values.’
The editors, David Trafimow, Ph.D. and Michael Marks, Ph.D. (Dr. Trafimow is Professor of Psychology and Dr. Marks is Associate Professor of Psychology at New Mexico State University) wrote:
“Confidence intervals suffer from an inverse inference problem that is not very different from that suffered by Null Hypothesis Significance Testing Protocol (NHSTP—such as p-values, t-values or F-values, statements about “significant” differences or lack thereof, and so on). In the NHSTP, the problem is in traversing the distance from the probability of the finding, given the null hypothesis, to the probability of the null hypothesis, given the finding. Regarding confidence intervals, the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval. Rather, it means merely a strong case of rejecting it. Confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval. Therefore, confidence intervals also are banned from BASP (Basic and Applied Social Psychology).”
Garbage In, Garbage Out
What the BASP journal editors were saying, in effect, is ‘garbage in, garbage out.’ If the ‘null hypothesis’ is garbage, then your study is garbage given a low ‘P-Value.’ ‘P-Value’ is being used to give context and legitimacy to studies. But the context it offers is not helpful if the goal is to deliver new information and answers to clinical problems.
What works better? According to the editors of BASP, “Instead of p-values, the journal will require strong descriptive statistics, including effect sizes.”
Brian Nosek, co-founder and executive director of the Center for Open Science said, in Christie Aschwanden’s excellent article titled ‘Science Isn’t Broken’: “Science operates as a procedure of uncertainty reduction. The goal is to get less wrong over time.”
So every clinical study has in it a temporary truth which itself is built on earlier research. This is an iterative process. Science works when everyone is standing on the shoulders of those who came before.
And for those who do try to stand on the shoulders of those who came before, they learn that science is difficult. It takes time and lots of iterative steps by lots of investigators to get to ‘truth.’ (See the Fingerprint of God article from September 25, 2015).
The InFuse Research Example
We think we saw this in action when we analyzed early Infuse BMP utilization data.
In the early years—2005 to 2009—the pace of clinical studies regarding the use of BMP2 accelerated dramatically. While The Spine Journal in its infamous June 2011 BMP issue focused on just 13 early studies, the reality is that there were hundreds of studies presented at various venues in those five years (see the table at the end of this article for a selection).
When we look at data from the 582, 135 spine fusion cases during those years which had been coded as having used BMP2 and compare the complication rates to 619, 106 spine fusion cases which excised bone for grafting instead of using BMP2, we think we can detect a change in surgeon behavior over those five years.
The following table presents U.S. procedure volumes for insertion of bone morphogenic protein (ICD-9 code 84.52 and CPT code 22851) and the associated complication rates for patients under 65 and then for patients over 65. This data is processed by PearlDiver and was collected from the Medicare Standard Analytical File, the Medicare Carrier File and the National Inpatient Sample.

Here’s the same analysis for bone graft harvesting (ICD-9 code 77.59).

What this data shows is that complication rates have declined consistently over those early BMP years. For this analysis, PearlDiver’s analyst, Scott Ellison, looked at a wide range of complications.

Did complications decline because spine surgeons were changing their use of InFuse based on both personal experience and clinical papers? Or was the decline due to other factors?
We think the surgeons were walking up a learning curve by reading the studies, listening to podium talks and learning from their own experience.
The scientific truth of using BMP in patients was revealed through a pattern of iterative experiences and studies.
It is science as it is meant to work and, if peer-review journal editors will allow, it can work.
Dump the ‘P-Value’
Criticizing ‘P-Value’ is not really a very courageous stand to take these days. As the video at the top of this article illustrated, it’s already under serious challenge from many corners of the scientific community. The obvious question, really, is what to put in its place?
There are lots of suggestions in the statistics blogosphere—more descriptive statistics or Bayesian approaches, for example.
But, it would be very interesting to see how orthopedic journal editors would tackle the shortcomings of the all-powerful ‘P-Value’ and what kinds of fresh ideas they would bring to represent and synthesize clinical evidence in their publications.
Answers, in other words, not just results.
Early BMP2 Studies:


