Covariate balance, Mill’s methods, and falsificationism

I presented my paper on polarization and corruption at the recent EPSA conference and encountered what was to me a surprising criticism. Having thought and read about the issues being raised, I want to jot down some ideas that I wish I had been able to say at the time.

First, some background: In the paper, I use variation in political polarization across English constituencies to try to measure the effect of polarization on legislative corruption (in the form of implication in the 2009 expenses scandal). One of the points I make in the paper is that although others have looked at this relationship in cross-country studies, my paper had the advantage that the units being compared were more similar on other dimensions than in the case of the cross-country studies, which means that my study should yield more credible causal inferences.

The criticism I encountered was that in seeking out comparisons where the units are as similar as possible, I was doing something like Mill’s Method of Differences, which had been shown to be valid only under a long list of unattractive assumptions, including that the process being considered be deterministic, monocausal, and without interactions.

Now, in seeking out a setting where the units being compared are as similar as possible in dimensions other than the “treatment,” I thought I was following very standard and basic practice. No one wants omitted variable bias, and it seems very straightforward to me that the way to reduce the possibility of omitted variable bias when you can’t run an experiment is to seek out a setting where covariate balance is higher before any adjustment is done. I think of the search for a setting with reasonable covariate balance as a very intuitive and basic part of the “design-based” approach to causal inference I learned about from Don Rubin and Guido Imbens, but also a key part of scientific inference in all fields for a long time. In response to the criticism I received, I said something like this — pointing out that the critic had also raised the possibility of omitted variable bias and thus should agree with me about the importance of restricting the scope for confounding.

I didn’t know at the time how to respond directly to the claim that I had sinned by partaking of Mill’s methods, but in the course of reviewing a comparative politics textbook (Principles of Comparative Politics, 1st edition (2009), by Clark, Golder, and Golder) I have reacquainted myself with Mill’s methods and I think I see where my critic was coming from — although I still think his criticism was off the mark.

What would it mean to use Mill’s method of differences in my setting? I would start with the observation that MPs in some constituencies were punished for being implicated in the scandal more heavily than others. I would then seek to locate the unique feature that is true of all of the constituencies where MPs were heavily punished and not true of the constituencies where they were not heavily punished. To arrive at the conclusion of my paper (which is that greater ideological distance between the locally-competitive candidates, i.e. (platform) polarization, reduces the degree to which voters punish incumbents for corruption), I would have to establish that all of the places where MPs were heavily punished were less polarized than the places where MPs were lightly punished, and that there was no other factor that systematically varied between the two types of constituencies.

This would clearly be kind of nuts. Electoral punishment is not deterministically affected by polarization, and it is certainly affected by other factors, so we don’t expect all of the more-polarized places to see less punishment than all of the less-polarized places. Also, given the countless things you can measure about an electoral constituency, there is probably some other difference that seems to be related to electoral punishment, but Mill’s method doesn’t tell you what features to focus on and what to ignore. Mill’s method is essentially inductive: you start with the difference you want to explain, and then you consider all of the possible (deterministic, monocausal) explanations until you’re left with just one. This process seems likely to yield an answer only when you have binary outcomes and causes, a small dataset, and the willingness to greatly constrain the possible causes you’re willing to consider. The answer that the methods yields would be suspect for all of the reasons rehearsed in the Clark, Golder and Golder book and the sources they cite.

I am not using Mill’s method of differences. I have a postulated relationship between polarization and electoral punishment, and I am attempting to measure that relationship using observational data. I am choosing to focus on units that are similar in other respects, but I am not doing this in order to inductively arrive at the one difference that must explain a given difference in outcomes; rather, I am focusing on these units because by doing so I reduce the scope for unmeasured confounding.

Clark, Golder, and Golder contrast Mill’s methods with the “scientific method” (a great example of a mainstream political science textbook extolling falsificationism and what Clarke and Primo criticize as the “hypothetico-deductive model”), which they argue is the right way to proceed. The virtue of the scientific method in their presentation is that you can make statements of the kind, “If my model/theory/explanation relating X and Y is correct, we will observe a correlation between X and Y” and then, if we don’t observe a correlation between X and Y, we know we have falsified the model/theory/explanation. The point of limiting the possibility of unobserved confounding is that the true logical statement we want to evaluate is “If my model/theory/explanation is correct and I have correctly controlled for all other factors affecting X and Y, we will observe a correlation between X and Y.” To the extent that we remain unsure about the second part of that antecedent, i.e. to the extent that there remains the possibility for unmeasured confounding, we are unable to falsify the theoretical claim: when we don’t observe the predicted correlation between X and Y we are unsure whether the model is falsified or the researcher has not correctly controlled for other factors. By seeking out settings in which the possibility for unmeasured confounding is restricted, we thus try to render our test as powerful as possible with respect to our theoretical claim.

I think this is an important point with respect to two important audiences.

First, I think it is important with respect to the comparative politics mainstream, or more broadly the part of social science that is not too concerned with causal inference. Clark, Golder and Golder is a very impressive book in many respects but it does not trouble its undergraduate audience much with the kind of hyper-sensitivity to identification that we see in recent work in comparative politics and elsewhere in social sciences. The falsificationist approach they take emphasizes the implications we should observe from theories without emphasizing that these implications should be observed if the theory is correct _and_ the setting matches the assumptions underlying the theory, at least after the researcher is done torturing the data. The scientific method they extol is weak indeed unless we take these assumptions seriously, because no theory will be falsified if we can so easily imagine that the consequent has been denied due to confounding rather than the shortcomings of the theory.

Second, I think it is important with respect to Clarke and Primo’s critique of falsificationism and the role of empirical work in their suggested mode of research. I agree with much of their critique of the way political scientists talk about falsifiable theories and hypothesis tests, and especially with their bottom-line message that models can be useful without being tested and empirical work can be useful without testing models. But their critique of falsificationism as practiced in political science (if I recall correctly – I don’t have the book with me) rests largely on the argument that you can’t test an implication of a model with another model, i.e. that the modeling choices we make in empirical analysis are so extensive that if we deny the consequent we don’t know whether to reject the theoretical model or the empirical model. My point is that the credibility of empirical work varies, and this affects how much you can learn from a hypothesis test. If someone has a model that predicts an effect of X on Y, we learn more about the usefulness of the model if someone does a high-quality RCT measuring the effect of X on Y (and everyone agrees that X and Y have been operationalized as the theory specified, etc) than we do if someone conducts an observational study; similarly, we learn more if someone does an observational study with high covariate balance than we do if someone does an observational study with low covariate balance. In short, I suspect Clarke and Primo insufficiently consider the extent to which the nature of the empirical model affects how much we can learn about the usefulness of a theoretical model by testing an implication of it. This suggests a more substantial role for empirical work than Clarke and Primo seem to envision, but also a continued emphasis on credibility through e.g. designing observational studies to reduce the possibility of unmeasured confounding.