Discussions about statistical significance are not usually found in newspapers, but the Associated Press recently had such a discussion about the results of a clinical trial involving a heart drug. Statistical significance refers to whether a study finds a “real” effect or whether any differences measured are a result of chance. For example, in the case of the heart drug study, the authors attempt to measure whether the drug reduces patients’ mortality by comparing the mortality of patients on the drug to people not on the drug. The statistical significance reflects the authors’ confidence that the difference (reduced mortality) they find is not a fluke. As the article correctly states, “Significance is reflected in a calculation that produces something called a p‑value. Usually, if this produces a p‑value of less than 0.05, the study findings are considered significant. If not, the study has failed the test [i.e., the findings cannot be differentiated from random chance].”
The heart drug study had a p‑value of .059, meaning that the study’s authors are 94.1 percent confident that the apparent lower mortality they found for patients on the drug than for those not on the drug is real. By the standard .05 p‑value (95 percent confidence level) criterion, the study’s findings are not considered statistically significant.
In all scientific research (including clinical trials), scientists must make a tradeoff between the likelihood of accepting a false finding because it does reach statistical significance even though it is a statistical fluke (a false positive or Type I error) or rejecting a real finding because a study doesn’t reach statistical significance (a false negative or Type II error). Reducing the probability of Type I errors inherently increases the probability of Type II errors, and vice versa. In medicine, this tradeoff is between the possibility of ignoring a beneficial medical finding that does not reach statistical significance and endorsing a fruitless or harmful medical finding that does reach statistical significance.
The lead investigator of the heart drug study believes that by rejecting their findings because they don’t reach a p‑value of .05 their study falls in the former category: “the drug in fact produced a real benefit and…a larger or longer‐lasting study could have reached statistical significance.” But the article also refers to a study of another heart drug that “found a significant treatment effect for patients born in August but not July, obviously just a random fluctuation.”
What should be done about these tradeoffs? The article correctly states that the traditional cutoff of .05 is arbitrary and should be abolished. Instead, studies should report the p‑value and other accompanying evidence to allow the reader to decide how to use the results of scientific and medical studies.
These are old issues. Van Doren discussed them over 10 years ago in a review of The Cult of Statistical Significance by Stephen Ziliak and Deirdre McCloskey in the Cato Journal. The AP article echoes many of Ziliak and McCloskey’s points and the relevance of statistical significance and other seemingly scientific questions to public policy and decision‐making remains. Many health and safety decisions are delegated to bureaucracies that allegedly use scientific methods to decide what products and practices to allow on the market. In reality, values enter into these decisions in a variety of ways, including questions about how large sample sizes should be, the costs and benefits of decisions, and what level of statistical significance is accepted.
Policy debates often ignore the value questions inherent even in scientific research and fail to recognize that people with different values will come to their own conclusions based on the information available to them. Regulators and researchers should gather and disseminate this information without injecting their own values or rejecting findings based on an arbitrary level of statistical significance.