P Value Is
this Gold Standard for Statistical Validity Really
Valid?
Majid Ali, M.D.
P value is
considered as the gold standard for establishing
statistical validity of data presented in scientific
papers. During 1970s through 1990s, we used P values
to establish statistical significance of our data
published in prestigious journals, including JAMA
(The Journal of The American Medical Association),
Lancet, American Heart Journal, American Journal of
Clinical Pathology, and others ((me as the lead
author). During these years, I often wondered why we
had to bother to calculate P values when the data
spoke elegantly for itself. However, the reviewers
and editors had to be satisfied and P values simply
did that.
Lies,
Damn Lies, and Statistics
I do not know if
Samuel Clemens (Mark T.) knew about P values when he
weighed in on the subject of statistics. But if he
did, the Old Man Mark was clearly on the mark. The P
value has been stripped of its exalted stature,
though it continues to be widely used to sell drugs
of dubious value.
In 1920, the
British statistician Ronald Fisher introduced the P
value as an preliminary way to see if evidence was
significant in the commonsense way, not as a
definitive test for statistical validity (as it is
now considered). Fisher’s approach was to run an
experiment and see if the results were consistent
with what random chance might produce—the simple
notion now designated as null hypothesis. Next came
playing the devil's advocate and calculating the
probability (the P value) of proving wrong the mere
chance effect (null hypothesis).
The Gold
Standard' of Statistical Validity Not Valid Anymore
On February 12,
2014, the journal Nature published a searing
indictment of P value methods that must be read by
all doctors who claim to practice evidencebased
medicine. Nature, most scientists in the world agree
is the most highly esteemed science journal in the
wolrd. Please brace yourself as you read the
following quotes from Nature’s article entitled
"Statistical Errors  P values, the 'gold standard'
of statistical validity, are not as reliable as many
scientists assume."
* In 2005,
epidemiologist John Ioannidis of Stanford University
in California suggested that most published findings
are false2; since then, a string of highprofile
replication problems has forced scientists to
rethink how they evaluate results.
* "P values are
not doing their job, because they can't," says
Stephen Ziliak, an economist at Roosevelt University
in Chicago, Illinois, and a frequent critic of the
way statistics are used.
* "Change your
statistical philosophy and all of a sudden different
things become important," says Steven Goodman, a
physician and statistician at Stanford. "Then 'laws'
handed down from God are no longer handed down from
God. They're actually handed down to us by
ourselves, through the methodology we adopt."
* "What does it [P
value] all mean? One result is an abundance of
confusion about what the P value means."
* "Critics also
bemoan the way that P values can encourage muddled
thinking."
* "The P value was
never meant to be used the way it's used today."
* "Perhaps the
worst fallacy is the kind of selfdeception for
which psychologist Uri Simonsohn of the University
of Pennsylvania and his colleagues have popularized
the term Phacking; it is also known as
datadredging, snooping, fishing,
significancechasing and doubledipping.
"Phacking," says Simonsohn, "is trying multiple
things until you get the desired result" — even
unconsciously."
* "there are three
questions a scientist might want to ask after a
study: 'What is the evidence?' 'What should I
believe?' and 'What should I do?' One method cannot
answer all these questions, Goodman says: "The
numbers are where the scientific discussion should
start, not end."
* "Nature special:
Challenges in irreproducible research. These are
sticky concepts, but some statisticians have tried
to provide general ruleofthumb conversions (see
'Probable cause'). According to one widely used
calculation5, a P value of 0.01 corresponds to a
falsealarm probability of at least 11%, depending
on the underlying probability that there is a true
effect; a P value of 0.05 raises that chance to at
least 29%. So Motyl's finding had a greater than one
in ten chance of being a false alarm. Likewise, the
probability of replicating his original result was
not 99%, as most would assume, but something closer
to 73% — or only 50%, if he wanted another 'very
significant' result. In other words, his inability
to replicate the result was about as surprising as
if he had called heads on a coin toss and it had
come up tails."
EvidenceBased Medicine and Toxicities of Foods,
Environments, and Thought
Next time you hear
about claims of "evidencebased medicine" by doctors
who limit their work to prescribing drugs, please do
not engage them in a philosophical discourse. Simply
suggest that they read this article or the Nature
article cited here.
