We’re going to be presenting our paper, “Efficient Identification in Linear Structural Causal Models with Instrumental Cutsets” (with Bryant Chen and Elias Bareinboim), at NeurIPS 2019. The work is quite technical, requiring specialized background knowledge, so I’m offering this post as a lightweight introduction to one of our core results.
Let’s start simple. What do the words in the title even mean?
Suppose you’re trying to decide on the effectiveness of a drug by observing all patients who took it, and seeing if they were cured. You are comparing this with patients who took an older drug, to make sure that the new medicine is actually superior. We can draw a “causal graph”, which basically says that the medicine can cause patients to be cured, and being cured does not cause patients to have taken the medicine. This is denoted with an arrow from the medicine to the patient’s status:
However, that’s not the full picture. You see, this new drug is very expensive, so only rich people can afford it (this story is based in the USA). It turns out that wealthy people are also more likely to be cured even without treatment, simply because they do not need to keep working while sick, and can relax. This is called latent confounding - the amount of money a patient has, which was not gathered as part of the dataset, affects both whether the patient gets medicine, and whether the patient is cured. We denote such an effect with a dashed bidirected arrow between the two nodes in our graph:
The problem of identification is therefore the question: Given this causal graph, and the dataset, is it possible to find out the causal effect of taking the drug?
In the graph with latent confounding, the answer is no [1] - it is not possible to find out how much of the effect on health was due to taking the drug, and how much was due to the fact that those who were taking it were rich, and therefore more likely to get better anyways!
The goal of randomized clinical trials is therefore to get rid of this confounding - if people get the drug through the flip of a coin rather than the contents of their wallet, we have the situation in the first graph, which is identifiable, meaning that we can compute the effect of the drug on cure rates.
Unfortunately, a clinical trial is not always possible, whether due to ethical, monetary, or even feasibility issues. By creating algorithms that determine identifiability in arbitrary graphs, we allow researchers to simply gather as much data as they can, as well as the hypothesized causal relations between the data, and then stuff it all in a computer. The computer will either compute the causal effect, or tell the researchers that they’re out of luck [2]!
As an example, the graph above represents the situation where a doctor makes a treatment recommendation to the patient, sometimes recommending the new drug, and other times the old drug, independently of the patient’s financial status. It turns out that this model is not identifiable without further assumptions - if all the variables are binary, the effect can only be bounded, but cannot be uniquely determined [3]. That is, we might be able to say that the actual causal effect lies somewhere between a 50% increase in cure rate and a 10% decrease, meaning that we can’t even conclude that the drug isn’t harmful!
The model shown above is called the “Instrumental Variable”, or IV, and becomes identifiable once we add a few assumptions about the mechanisms underlying the graph [4]. To demonstrate, let’s modify the above example:
Previously, the doctor recommended one drug or the other, the patient likewise took a binary dose (drug 1 or 2), and was either cured or not cured. Instead, suppose the doctor’s recommendation is continuous: the doctor can advocate the drug either lightly, strongly, or anywhere in between. Let’s assign a number to the strength of this advocacy:
Similarly, now the patient chooses how much of the treatment regimen to take:
Next, instead of “cured” and “not cured”, let’s make the result be the count of a biomarker in the patient’s blood:
Finally, let’s assume that the effects are linear. That is, the amount of medicine the patient takes is linearly related to how strongly the doctor believed the medicine would work. Similarly, the biomarkers in the patient’s blood are linearly related to the amount of medicine taken. Our job, then, is to solve for the amount that the blood marker will decrease per day of the treatment regimen. This is represented by
Here,
To simplify the math (without loss of generality), let’s assume that
In effect, we have turned a dataset of samples
We can use the fact that
As you can see, despite the patients’ choice of drug and their biomarkers being confounded by wealth, we were able to extract the drug’s causal effect anyways. This property has made the instrumental variable a cornerstone of observational studies since their introduction, all the way back in 1928 [5].
In general, when you want to find the causal effect between
Of course, if a researcher does not find an instrument in their causal graph, it does not mean that their target effect is not identifiable! An example of this is the instrumental set [6]:
Notice that
This system of equations is solvable for
In this example, we can create an identical system of equations, but this system will now be degenerate, meaning that
Thankfully, there is a simple condition that is sufficient to determine whether the system is generically solvable with unconditioned instrumental sets. We choose two sets of the same size: one of candidate instruments, and one of the parents of
In the first example,
In the second example,
In the final example, both
This is where our paper starts: it isn’t clear how to efficiently find an unconditioned instrumental set. If we know which parents of the target node (
In this graph, only
One way of approaching this is enumerating all subsets of edges incoming into the target node (i.e. all subsets of
We developed a very simple algorithm that quickly finds the set without needing to enumerate all possibilities. Here’s how it works. First, we find a max-flow from all the candidate instruments
Now, we notice that
Once again,
All the remaining
It turns out that we can get a lot of mileage out of this simple procedure. You see, most state-of-the-art identification algorithms have a similar structure to the instrumental set, where they needed to enumerate all subsets of edges incoming into a target node. We were able to adapt the match-block to these algorithms, and even extend it into a new method, which can solve for certain edges that no other efficient method we know of can solve! We called our method the “Instrumental Cutset”, and you can read about the details in our full paper on arXiv.