Being careful about hypotheses

We don’t talk enough about careful scientific process when it comes to setting up experiments (in startup land, often, launching a new feature for a web product). In particular, we rarely discuss our prior belief of the likely effectiveness of a given feature in driving a certain behavior.

There’s a a provocatively titled paper by John Ioannidis claiming the majority of scientific results are actually false. It hinges on (among other issues) the prior likelihood of a tested hypothesis being true. I think the math there is instructive to think about.

If a p-value indicates the probability of seeing a given experimental outcome given a false hypothesis, we can accept only results with a p-value of .05 or less and expect to make mistakes no more than 95% of the time.

However, when we zoom out and look at many tests like this as a whole, we have to ask ourselves about the overall probability of a given hypothesis being true. If we say scientists (or product managers!) are only generating/testing correct hypotheses 10% of the time, our p-value means something different. We know from Bayes’ Theorem:

Where is the probability of a false hypothesis given a positive test result (p-value < .05).

Since:

is the probability of getting a positive result given a false hypothesis (the type I error rate)
is the overall probability of a false hypothesis
is the overall probability of a true hypothesis (inverse probability of a false hypothesis)
is the probability of getting a negative result where there is a real relationship (one minus the power).

And since we often set our type I error rate to .05 and our power to .80, if we claim 10% of theories that get tested are true, we get:

Which tells us that under the theory that the outside chance of a hypothesis being right is 50%, the probability of a result being false even though the p-value is less than .05 is actually 20%. This is way higher than the type I error rate we claim to test at, and indicates we may make mistakes interpreting the results of about 1 in 5 experiments.