The Base Rate Fallacy
Alias: Neglecting Base Rates1
Suppose that the rate of disease D is three times higher among homosexuals than among heterosexuals, that is, the percentage of homosexuals who have D is three times the percentage of heterosexuals who have it. Suppose, further, that Pat is diagnosed with the disease, and this is all that you know about Pat. In particular, you don't know anything else about Pat's sexual orientation; in fact, you don't even know whether Pat is male or female. What is the likelihood that Pat is homosexual?
When judging the probability of an event―for instance, diagnosing a patient's disease―there are two types of information that may be available:
- Generic information about the frequency of events of that type. In the case of diagnosing a disease, this would be information about the prevalence of the disease.
- Specific information about the case in question. In the case of diagnosis, this would be information about the patient revealed by an examination or tests.
When contrasted with information of type 2, type 1 information is called "base rate" information. For example, if a doctor is considering whether a patient has a certain rare disease, the rarity of the disease is its base rate. In other words, the base rate is the frequency of a generic type of event, leaving aside any information about the specific case at hand.
People who have only generic information tend to use it to judge probabilities, which is the rational thing to do since that's all that they have to go on. In contrast, when people have both types of information, they tend to make judgments of probability based entirely upon specific information, leaving out the base rate. This is the base rate fallacy.
When you have both generic and specific information, it might seem reasonable to ignore the general information in favor of the more specific. This would indeed be the right thing to do if you have only one type of information, but you should use all of the information you have. This is an application of a principle of inductive logic called "the requirement of total evidence", which requires that all relevant evidence be used in inductive reasoning2.
It's easy to see that failing to use relevant information can lead to false conclusions: for instance, suppose that every swan you have seen was white except for some black ones. It would be a mistake to ignore the black swans and conclude that all swans are white since there are black ones from Australia. The base rate fallacy is a specific mistake of this type, that is, a failure to use all relevant information in an inductive inference.
The exact answer to this problem depends upon what percentage of the population is homosexual. We don't know that exactly, but let's suppose that it is 10%. We don't need to be precise since this is a "back of the envelope" calculation designed to check that our intuitive judgments are in the ballpark. So, suppose that we have a population of 100 people, 10 of whom are homosexuals. Suppose, further, that three of the homosexuals have disease D, which means that the rate of the disease among the homosexuals is 3 out of 10, or 30%. Since we are given that the rate of the disease among heterosexuals is one-third of that among homosexuals, we must suppose that 10% of the heterosexuals in the population have D, which means that 9 of the 90 heterosexuals have D. So, the total number of persons with the disease in our population is 12, three of whom are homosexuals. Thus, all that we know about Pat is that he or she has D, so Pat is one of the unlucky twelve. Therefore, the chance that Pat is homosexual is 3 in 12, or 25%.
If you're like most people, you probably estimated that the likelihood that Pat is homosexual is much higher than 25%. If you thought it was 75%, then you were probably basing your estimate on the fact that the rate of the disease is three times higher among homosexuals. In doing this, you neglected to take into consideration the base rate of homosexuality in the population. You might not have had any precise information on this rate, but it is common knowledge that homosexuals are a small minority. For this reason, even though the rate of the disease is three times higher in the homosexual sub-population, it is still more likely that a randomly chosen person with the disease is a heterosexual, simply because they are the vast majority of the population.3
To prove that the example above is correct, use Bayes' Theorem from probability theory: Let "h" represent the proposition that Pat is homosexual and "d" the proposition that Pat has disease D. We assumed that the base rate of homosexuality is 10%, so P(h) = .1. Therefore, the probability of Pat not being homosexual is 90%, that is, P(not-h) = .9. We don't know the exact rates of disease D among homosexuals or non-homosexuals, but we are given that the rate among the former is three times the rate among the latter, so P(d | h) = 3P(d | not-h). If we plug this information into Bayes' Theorem, we get the following equation:
P(h | d) =
[3P(d | not-h) × .1]/[3P(d | not-h) × .1] + .9P(d | not-h)
After multiplying and adding, we get:
P(h | d) = .3P(d | not-h)/1.2P(d | not-h)
The "P(d | not-h)"s in both the numerator and denominator cancel out, giving us the answer:
P(h | d) = 3/12 = .25, that is, the probability that Pat is homosexual given that he/she has disease D is 25%.
- Amos Tversky & Daniel Kahneman, "Evidential Impact of Base Rates", in Judgment Under Uncertainty: Heuristics and Biases, Kahneman, Paul Slovic, and Tversky, editors (1985), pp. 153-160.
- Patrick J. Hurley, "1.4 Validity, Truth, Soundness, Strength, Cogency", A Concise Introduction to Logic (10th edition).
- Maya Bar-Hillel, "The Base Rate Fallacy in Probability Judgments", Acta Psychologica 44, pp. 211-233. The thought experiment is a variation of the "suicide problem" discussed on pages 221-223.