The Multiple Comparisons FallacyTaxonomy: Logical Fallacy >Formal Fallacy > Probabilistic Fallacy > The Multiple Comparisons Fallacy
The name "multiple comparisons fallacy" appears to come from the science of epidemiology, where comparisons may be made between a diseased group and a healthy group in order to find a difference between the two that might point to the cause of an epidemic. For instance, if every member of the diseased group drank from a particular well and no member of the healthy group did so, that would suggest that the pathogen might be present in the well water. In order to find the source of an epidemic, multiple comparisons between the groups may be drawn.
…[I]n 1992, a landmark study appeared from Sweden. A huge investigation, it enrolled everyone living within 300 meters of Sweden's high-voltage transmission line system over a 25-year period. They went far beyond all previous studies in their efforts to measure magnetic fields, calculating the fields that the children were exposed to at the time of their cancer diagnosis and before. This study reported an apparently clear association between magnetic field exposure and childhood leukemia, with a risk ratio for the most highly exposed of nearly 4.
…Surely, here was the proof that power lines were dangerous, the proof that even the physicists and biological naysayers would have to accept. But three years after the study was published, the Swedish research no longer looks so unassailable. …[T]he original contractor's report…reveals the remarkable thoroughness of the Swedish team. Unlike the published article, which just summarizes part of the data, the report shows everything they did in great detail, all the things they measured and all the comparisons they made. …[N]early 800 risk ratios are in the report….
Though recognition of the multiple comparisons fallacy seems to have come from epidemiology, it's a danger in any statistical study that compares groups of things. The kind of reasoning used to draw conclusions from such studies is called "inductive". In inductive reasoning, there is always some chance that the conclusion will be false even if the evidence is true. In other words, the connection between the premisses and conclusion is never 100%―that's only for deductive reasoning. So, the question arises: what level of probability―called a "confidence level"―are we willing to accept in our reasoning? In scientific contexts, the confidence level is usually set at 95%, and when a result occurs with a probability less than or equal to 5% it is said to be "statistically significant" at the 95% confidence level.
When the confidence level is set at 95%, there is a probability of one in twenty―that is, 5%―that a misleading result will occur simply by chance. This has an important consequence that when overlooked leads to the multiple comparisons fallacy. For instance, when comparisons are done in epidemiology, there is a one in twenty chance that such a comparison will show a statistically significant difference. So, if twenty or more comparisons are made in a single study, it will likely get at least one statistically significant result just by chance. Thus, it's necessary to use a higher confidence level in cases of multiple comparisons.
Actually, the situation is worse still: if the things being compared are statistically independent, then it takes only fourteen comparisons for it to be more likely than not to get a statistically significant result by chance. This is a result of the multiplication rule of probability theory―see the entry for Probabilistic Fallacy: the multiplication rule is a consequence of Axiom 4.
While this fallacy is a danger within individual studies that compare groups, it also is a danger between studies of the same subject. That is, rather than a single study comparing two groups on twenty different variables, there might be twenty different studies each of which compares the two groups on a single variable, but each studying a different variable. It would be likely that at least one such study would find a statistically significant result just by chance.
Thus, in order to avoid the fallacy, it's not sufficient simply to avoid committing it within individual studies. Instead, individual studies need to be evaluated within the context of all the research done on the same topic. Unfortunately, this may be impossible due to what's called "the file drawer effect", or "publication bias"―see the Resource, below―which is the fact that scientific journals prefer to publish studies that get statistically significant results rather than those that do not. As a result, studies that fail to reach statistical significance may simply be filed away in a drawer, unpublished, which makes it difficult to tell how many such studies have been performed.
The multiple comparisons fallacy is occasionally referred to as "the Texas sharpshooter's fallacy", but I use this name for a different type of mistake. The anecdote that gives rise to the name is that a Texan shoots randomly at the side of a barn, then draws a bullseye around a cluster of bullet holes and claims to be a sharpshooter. This story fits the mistake of jumping to the conclusion that a random cluster of data must be causally related better than it does the multiple comparisons fallacy. A better anecdote for the latter would be a shooter who first draws the bullseye, then randomly shoots twenty times at the barn. Having made one bullseye, the shooter then proceeds to conceal the nineteen misses and claims to be a sharpshooter.
Source: Po Bronson, "A Prayer Before Dying", Wired, 2002
When scientists saw how many things they had measured…they began accusing the Swedes of falling into one of the most fundamental errors in epidemiology, sometimes called the multiple comparisons fallacy.
John Moulder: The problem is, when you do as they did, hundreds and hundreds of comparisons, something in the neighborhood of 800 different comparisons, by the standard way we do statistics, we would expect 5 percent of those to be statistically elevated and 5 percent to be statistically decreased. And now you have a problem. If you find, by one measure of exposure, that leukemia is up in a group of kids, is that real, or is that the result of just random noise in the system?
Narrator: … Even if nothing is going on due to power lines, if you measure hundreds of risk ratios, they will scatter by random chance around a mean of one. Some will be above, and some below. Risk ratios below one suggest that EMFs protect against cancer, above one, that they increase the cancer rate. But the published article focused only on the strongest positive risk ratios. The summary highlights a nearly fourfold increase in risk of childhood leukemia. This is what the press picks up and the public hears.
John Moulder: It is not scientifically reasonable to do all the measurements, but then only pick out the ones that give you the answer you want for publication. If I dredge through their original report, I can find situations which, looked at in isolation, without looking at the rest of the report, that if that was the only data I gave you, I could claim that that proved that power lines protected children against childhood leukemia.
Moulder appears to have misspoken in the second quote here: probably what he meant to say is that, given a 95% confidence level, there is a 5% probability of a "statistically significant" result, which breaks down to a 2.5% chance of a significantly low result and a 2.5% probability of a significantly high result, rather than a 5% chance in either direction. Or, perhaps something was edited out of his interview that indicated the 90% confidence level was being discussed, in which case his percentages would be correct.
Acknowledgments: Thanks to David Nichols for a couple of corrections in the Exposition, and to Olof Ericson for the correction to the analysis section.