One of the traumas of graduate school for me was that my training in statistics re-oriented me away from big Why questions (“Why is there disorder/poverty/dictatorship in some places and not in others?”) toward smaller questions like “Does winning office enrich politicians?” The most influential advice I received (and which I now give) was to focus on the effects of causes rather than the causes of effects, or put differently to ask “What if” rather than “Why”. (I recall Jim Snyder saying in his seminar that your paper title really shouldn’t begin with the word “Why”; I am glad to say that I managed to point out one of his best-known papers is called “Why is there so little money in U.S. politics?”) There are important “what if” questions, but when we think big we usually start with a puzzling phenomenon and ask why it occurs: Why are there lots of political parties/coups/civil wars in some countries and not in others? In fact, before I fell under the influence of people like Don Rubin, Guido Imbens, and Jim Snyder, the main advice I had received for empirical work was to identify a puzzle (an anomaly in a dependent variable) and then explain it in my paper, and many people continue to adhere to that approach. What I and others struggle with is the question of how the “what if” approach relates to the “why” approach. If we want to explain the “why” of a phenomenon (e.g. polarization, conflict, trust), do we do it by cobbling together a bunch of results from “what if” studies? Or should we stay away from “why” questions altogether?
Gelman and Imbens have taken on these issues in a short paper that puts “Why” questions in a common statistical framework with “What if” questions; in the title, the “Why” questions are “reverse causal questions” while the “What if” approach is covered by “forward causal inference”. I think their main contribution is to clarify what we mean when we ask a “Why” question: we mean that there is a relationship between two variables that we would not expect given our (perhaps implicit) statistical model of a particular phenomenon. Thus when we ask a “Why” question, we are pointing out a problem with our statistical model, which should motivate us to improve the model. Using the example of cancer clusters, in which the anomaly is that some geographic areas have unusually high cancer rates, Gelman and Imbens highlight one way to improve the model: add a variable. When we add that variable we might think of it as a cause that we could potentially manipulate (e.g. the presence of a carcinogenic agent) or as a predictor (e.g. the genetic background of the people in the area), but the idea is that we have explained the anomaly (and thus provided an answer to the “why?” question) when the data stops telling us that there is an association we don’t expect.
One of the key points the authors make is that there may be multiple answers to the same “why?” question. What do they mean? My reading was: Continuing with the cancer cluster example, the puzzling association might go away both when we control for the present of a carcinogenic agent and when we control for genetic background; this particular indeterminacy is an issue with statistical power, because the anomaly “goes away” when the data no longer reject the null hypothesis for the particular variable under consideration. There are thus two explanations for the cancer clusters, which may be unsatisfactory but is correct under their interpretation of “Why” questions and how they are resolved.
A related point is that there are multiple ways to improve the model. The authors emphasize the addition of a variable, I think because they want to relate to the causal inference literature (and so the question is whether the variable you add to explain an anomaly can be thought of as a “cause”), but elsewhere in the paper they mention statistical corrections for multiple comparisons (particularly relevant for the cancer cluster example) and the introduction of a new paradigm. I wondered why they don’t discuss the option of accepting that the anomalous variable is a cause (or at least a predictor) of the outcome. Using an example from the paper, this would be like looking at the height and earnings data and concluding that height actually does influence (or at least predict) earnings, which means changing the model to include height (in which case there is no longer an anomaly). I guess the attractiveness of this solution depends on the context, and particularly how strong one’s a priori reasons for ruling out the explanation is based on the science; in the case of cancer clusters, you might be correct in saying that there is no good reason to think that fine-grained location on the earth’s surface actually would affect cancer, and thus that there must be some other environmental cause — even if that cause is something highly related to position on the earth’s surface, such as magnetic fields or soil deposits.
A lingering question here is about the distinction between a predictor and a cause.