The Most Important Scientific Calculation You Never Do

In August of 2005, John Ioannidis published an essay “Why Most Published Research Findings are False.”[1] He highlighted many fundamental causes. One of the most widespread is multiple testing error.

Consider this simple observation: if you test a single experimental hypothesis every working week for a year, there is a 91% chance that one or more of your findings over the course of the year is false.

How can that be true? Let’s assume you only accept an experimental finding if it has a 95% chance of being true (P<0.05†).  Then chance of finding a true result in every one of the 48 working weeks over a year is (95%)^48 = 8.526%. On the other hand, the chance you do not find a true result every week (i.e., you accept one or more false findings) is the converse: 100% - 8.526% = 91.474%.

As a full-time scientist you test a lot of hypothesis—a lot more than one per week. So your chances of false findings are probably much higher than 91% each year. Every sample or condition you compare in an experiment is a hypothesis test. If you test tens or hundreds or even thousands of samples in a single experiment, it’s guaranteed that you’ll find many false positive results, not just every year, but in every experiment.

This issue of multiple testing error is well-known in statistics. When you do something enough times, even low-probability outcomes will occur. This can have drastic consequences for your interpretation of scientific results. If you don't account for this, you could waste a months of your and other people's time.

For example, suppose you are the head of cell line development. Your company needs to begin clinical trials of a therapeutic antibody in 3 months. Your job is to develop a new cell line that meets the required titer and quality for clinical manufacturing. The head of manufacturing comes to you and says: “I can take 6 candidate cell lines into process development. I need your assurance that at least 4 of those cell lines are truly improved in titer, because 3/4 will fail for other reasons like quality and stability. If we do not meet our production titer in 3 months, it will be a 3-6 month delay to clinical trials, and will cost the company an additional $10M in operating expenses. So if you can’t assure 4 improved cell lines, I’m going to commit my team to process improvements on the existing cell line, instead of developing your new cell lines.”

You just ran a screen of 96 candidate cell lines on your microtiter plate assay. The results are in Figure 1 below. The top 6 cell lines in your screen passed with a P-values less than 0.027. So you’re feeling confident. You decide to give her the 6 cell lines and assure her that 5, if not all 6, are truly improved. Did you make the right decision?

 

Figure 1: Cell line selection based on P-value. The blue circles represent the titers measured for 96 cell line in a screening assay. Unimproved cell lines would be expected to perform at a assay value of 0. Open circles were rejected as unimproved cell lines. Solid circles are potentially-improved cell lines based on passing the P<0.027 acceptance threshold. Data for this plot comes from this spreadsheet.

 

No you didn’t. As you can see from Figure 2 below, half of your cell lines are actually false positives, meaning you only gave your colleague 3 improved ones. You did not meet her requirement of 4 improved cell lines.

 

Figure 2: Cell line selection based on FDR. The data is the same as Figure 1. But now the actual cell line characteristics are revealed. Red circles represent cell lines that are truly improved. At the P<0.027, 50% of the accepted cell lines are, false positives. Thus the actual FDR is 50%.  The Q-value of 49% is calculated only from P-values using no knowledge of actual true or false positives. It suggests that 49% of the accepted cell lines are false positives. Thus Q-values provide an excellent estimate of the FDR. Data for this plot comes from this spreadsheet.

 

How could you have known you were wrong? What could you have done better? You could have calculated the False Discovery Rate (FDR), which tells you what fraction of your accepted tests are false. That’s really valuable because it gives you a numerical estimate of how enriched your accepted discoveries are for true findings. Better yet, you can use the FDR value as a “prior probability” of true findings in follow-on confirmation experiments. This will allow you to design those experiments more efficiently.

One way to calculate FDR is the Benjamini-Hochberg (BH) method.[2] The BH method estimates the FDR for your experiment using only the P-values you calculated—this estimate is called a “Q-value.” As you can see from Figure 2, the Q-value provides an excellent approximation of the actual FDR. Had you calculated a Q-value, you would have told the head of manufacturing that your cell line candidates were not good enough to meet her needs.

False Discovery Rate estimates are an effective solution to multiple testing problem. There is no downside to using them; they can be very simple to calculate (we show you how to do that here[3]); and the information FDR provides is dramatically more informative than a P-value. Moreover, FDR is much more aligned with your intuition. If someone gives you 10 apples and says “I think 90% of these are good”, then you’d expect only one bad apple in the lot.  That’s exactly what FDR tells you.

Should you want to get more sophisticated with your FDR calculations, there are some additional techniques you can apply beyond Benjamini-Hochberg.[4] One is a correction when your tests have dependencies between them.[5,6] Another is a powerful weighting algorithm[7] that uses additional information from your experiments to minimize FDR, while gaining statistical power (fewer false negatives). You might also apply a different approach to multiple testing called Family-Wise Error Rate. We discuss that here.[8] But honestly, if all you ever use is the Benajmani-Hochberg estimate of FDR, you’ll still be in much better off than making decisions based on raw P-values.

Finally, we note that FDR estimates are only a means of adjusting acceptance thresholds to minimize false discoveries. Moving those thresholds will always increase your false negatives in order to lower false discoveries, or vice versa. If you want both fewer false discoveries and fewer false negatives at the same time, you need to improve your experimental design. One good approach is tiered experiments (a form of replication). Another is to perform variance analysis and reduce your experimental error.  We’ll talk about both of them in future posts.

Notes

†Accepting results at P-value threshold of P<0.05 technically means you have a 5% probability of accepting a false positive result (from the null hypothesis) in that one test. You therefore have a 95% probability of rejecting a false result, and conclude that any result that does pass the threshold is true with 95% probability.

Citations

[1] Ioannidis, J.P.A. (2005) Why Most Published Research Findings Are False. PLoS Medicine. 2(8): e124. https://doi.org/10.1371/journal.pmed.0040215

[2] Benjamini, Y. & Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological). 57(1): 289-300.

[3] Gardner, T.S. (2017 Oct) False Discovery Rate. Retrieved from: https://riffyn.com/riffyn-blog/2017/10/29/false-discovery-rate

[4] Devlin, B., et al. (2003) Statistical Genetics: False discovery or missed discovery? Heredity. 91: 537–538. https://doi.org/10.1038/sj.hdy.6800370

[5] Winkler, A.M. (2011 Sep) False Discovery Rate: Corrected & Adjusted P-value. Retrieved from: https://brainder.org/2011/09/05/fdr-corrected-fdr-adjusted-p-values/

[6] Benjamini, Y & Yekutieli, D. (2001) The control of false discovery rate in multiple testing under dependency. The Annals of Applied Statistics, 29(4): 1165-1188.

[7] Ignatiadis, N, et al. (2016) Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods 13: 577–580, https://doi.org/10.1038/nmeth.3885

[8] Gardner, T.S. (2017 Oct) Family-Wise Error Rate. Retreived from: https://riffyn.com/riffyn-blog/2017/10/29/family-wise-error-rate

Timothy Gardner1 Comment