We’re going to be presenting our paper, “Efficient Identification in Linear Structural Causal Models with Instrumental Cutsets”, at NeurIPS 2019. The work is quite technical, requiring specialized background knowledge, so I’m offering this post as a lightweight introduction to one of our core results.

Let’s start simple. What do the words in the title even mean?

Suppose you’re trying to decide on the effectiveness of a drug by observing all patients who took it, and seeing if they were cured. You are comparing this with patients who took an older drug, to make sure that the new medicine is actually superior. We can draw a “causal graph”, which basically says that the medicine can *cause* patients to be cured, and being cured does *not* cause patients to have taken the medicine. This is denoted with an arrow from the medicine to the patient’s status:

However, that’s not the full picture. You see, this new drug is very expensive, so only rich people can afford it (this story is based in the USA). It turns out that wealthy people are also more likely to be cured even without treatment, simply because they do not need to keep working while sick, and can relax. This is called *latent confounding* - the amount of money a patient has, which was not gathered as part of the dataset, affects both whether the patient gets medicine, and whether the patient is cured. We denote such an effect with a dashed bidirected arrow between the two nodes in our graph:

The problem of identification is therefore the question: **Given this causal graph, and the dataset, is it possible to find out the causal effect of taking the drug?**

In the graph with latent confounding, the answer is *no* [1] - it is not possible to find out how much of the effect on health was due to taking the drug, and how much was due to the fact that those who were taking it were rich, and therefore more likely to get better anyways!

The goal of randomized clinical trials is therefore to get rid of this confounding - if people get the drug through the flip of a coin rather than the contents of their wallet, we have the situation in the first graph, which is *identifiable*, meaning that we can compute the effect of the drug on cure rates.

Unfortunately, a clinical trial is not always possible, whether due to ethical, monetary, or even feasibility issues. By creating algorithms that determine identifiability in arbitrary graphs, we allow researchers to simply gather as much data as they can, as well as the hypothesized causal relations between the data, and then stuff it all in a computer. The computer will either compute the causal effect, or tell the researchers that they’re out of luck [2]!

As an example, the graph above represents the situation where a doctor makes a treatment recommendation to the patient, sometimes recommending the new drug, and other times the old drug, independently of the patient’s financial status. It turns out that this model is not identifiable without further assumptions - if all the variables are binary, the effect can only be *bounded*, but cannot be uniquely determined [3]. That is, we might be able to say that the actual causal effect lies somewhere between a 50% increase in cure rate and a 10% *decrease*, meaning that we can’t even conclude that the drug isn’t harmful!

The model shown above is called the “Instrumental Variable”, or IV, and becomes identifiable once we add a few assumptions about the mechanisms underlying the graph [4]. To demonstrate, let’s modify the above example:

Previously, the doctor recommended one drug or the other, the patient likewise took a binary dose (drug 1 or 2), and was either cured or not cured. Instead, suppose the doctor’s recommendation is continuous: the doctor can advocate the drug either lightly, strongly, or anywhere in between. Let’s assign a number to the strength of this advocacy:

$$ D = \text{how strongly the doctor advocated for the new drug} $$

Similarly, now the patient chooses how much of the treatment regimen to take:

$$ M = \text{how long the patient took the new medicine} $$

Next, instead of “cured” and “not cured”, let’s make the result be the count of a biomarker in the patient’s blood:

$$ B = \text{count of a target biomarker in the patient's blood} $$

Finally, let’s assume that the effects are linear. That is, the amount of medicine the patient takes is linearly related to how strongly the doctor believed the medicine would work. Similarly, the biomarkers in the patient’s blood are linearly related to the amount of medicine taken. Our job, then, is to solve for the amount that the blood marker will decrease per day of the treatment regimen. This is represented by $\lambda_{mb}$ in the following system of equations:

Here, $\epsilon$ represent latent error terms, and have different values for each patient (they are random variables), and $\lambda$ are the causal effects, which are assumed to be identical for all patients. Of course, the only data we have is samples of $(D,M,B)$ - we don’t actually know what any of the $\epsilon$ or $\lambda$ are! Can we solve for $\lambda_{mb}$? Yes we can! We can use the dataset to find the covariances between the variables:

$$ \sigma_{ab} = \mathbb{E}[(A-\mathbb{E}[A])(B-\mathbb{E}[B])] = \mathbb{E}[AB] - \mathbb{E}[A]\mathbb{E}[B] $$

To simplify the math (without loss of generality), let’s assume that $D,M,B$ are normalized, meaning that the data has mean 0. This makes $\sigma_{ab} = \mathbb{E}[AB]$.

In effect, we have turned a dataset of samples $(D,M,B)$ into known values for the covariances $\sigma_{dm},\sigma_{db},\sigma_{mb}$. Now, we can do a bit of mathematical magic:

$$ \sigma_{db} = \mathbb{E}[DB] = \mathbb{E}[D(\lambda_{mb}M+\epsilon_b)] = \lambda_{mb} \mathbb{E}[DM] + \mathbb{E}[D\epsilon_b] $$

We can use the fact that $D$ is uncorrelated with $\epsilon_b$, and that the variables have mean 0 to conclude:

$$ \sigma_{db} = \lambda_{mb}\sigma_{dm} \hspace{1cm} \Rightarrow \hspace{1cm} \lambda_{mb} = \frac{\sigma_{db}}{\sigma_{dm}} $$

As you can see, despite the patients’ choice of drug and their biomarkers being confounded by wealth, we were able to extract the drug’s causal effect *anyways*. This property has made the instrumental variable a cornerstone of observational studies since their introduction, all the way back in 1928 [5].

In general, when you want to find the causal effect between $M$ and $B$, you look for a node ($D$) called the “instrument”, which has all of its paths to $B$ pass through $M$.

Of course, if a researcher does not find an instrument in their causal graph, it does not mean that their target effect is not identifiable! An example of this is the instrumental set [6]:

Notice that $x_1\rightarrow y$ has no instruments - both $z_1$ and $z_2$ have paths to $y$ through $x_2$. Nevertheless, using the same math as in the previous example, we can get:

$$ \begin{aligned} \sigma_{z_1y} &= \sigma_{z_1x_1}\lambda_{x_1y} + \sigma_{z_1x_2}\lambda_{x_2y}\\ \sigma_{z_2y} &= \sigma_{z_2x_1}\lambda_{x_1y} + \sigma_{z_2x_2}\lambda_{x_2y}\\ \end{aligned} $$

This system of equations is solvable for $\lambda_{x_1y}$ if $\det\begin{pmatrix} \sigma_{z_1x_1} & \sigma_{z_1x_2}\\ \sigma_{z_2x_1} & \sigma_{z_2x_2} \end{pmatrix} \ne 0$. An example where this requirement is not satisfied is:

In this example, we can create an identical system of equations, but this system will now be degenerate, meaning that $(\sigma_{z_1x_1},\sigma_{z_1x_2}) = \lambda_{z_1z_2}(\sigma_{z_2x_1},\sigma_{z_2x_2})$. In fact, $\lambda_{x_1y}$ is not identifiable here.

Thankfully, there is a simple condition that is sufficient to determine whether the system is generically solvable with unconditioned instrumental sets. We choose two sets of the same size: one of candidate instruments, and one of the parents of $Y$. If all paths from each of the instruments to $Y$ cross one of the chosen parents, and there is a way to set up non-intersecting paths from all the chosen instruments to the chosen parents, then we know that all chosen edges incoming to $Y$ are solvable. To demonstrate, let’s look at a couple examples:

In the first example, $z_1$ and $z_2$ have non-intersecting paths ($z_1\rightarrow x_1$ and $z_2\rightarrow x_2$), and $z_1,z_2$ have all their paths to $y$ cross either $x_1$ or $x_2$. This means that $z_1,z_2$ can be used as an instrumental set to solve for $\lambda_{x_1y}$ and $\lambda_{x_2y}$.

In the second example, $z_2$ has a path to $x_1$, but it can still get to $y$ through $x_2$. $z_1$ cannot help here, because all paths from $z_1$ to $x_2$ would cross $z_2$! This means that we don’t have a instrumental set.

In the final example, both $z_1$ and $z_2$ have nonintersecting paths to $x_1$ and $x_2$, but $z_2$ can still get to $y$ by crossing through $x_3$. Once again, there is no instrumental set here.

This is where our paper starts: it isn’t clear how to *efficiently* find an unconditioned instrumental set. If we know *which* parents of the target node ($y$) are part of the set, then it is possible to find associated instruments using a max-flow [7]. Unfortunately, we don’t generally have this knowledge. Look at the graph here:

In this graph, *only* $\{x_1,x_2,x_5\}$ has an instrumental subset $\{z_1,z_2,z_4\}$. How can we make a computer find it?

One way of approaching this is enumerating all subsets of edges incoming into the target node (i.e. all subsets of $\{x_1,x_2,x_3,x_4,x_5\}$), and then checking if an instrumental set exists to those nodes. This enumeration takes $2^N$ time, where $N$ is the number of edges incoming into that node - a node with 50 parents would require checking 1,125,899,906,842,624 different combinations. In other words, this requires *exponential-time*, which is equivalent to “can’t be used for large problems”.

We developed a very simple algorithm that quickly finds the set without needing to enumerate all possibilities. Here’s how it works. First, we find a max-flow from *all* the candidate instruments $z$ to *all* the $x$:

Now, we notice that $x_4$ had no flow to it. We proved that this is sufficient to guarantee that $x_4$ is not part of *any* instrumental set. Furthermore, if $x_4$ isn’t part of an instrumental set, then none of its ancestors can be instruments (all of them would have a path to $x_4$). We therefore remove $x_4$ and $z_3,z_5$ from candidacy, and run another max-flow:

Once again, $x_3$ had no flow to it, so we disable it, and all of its ancestors (here, its ancestors are already disabled). Finally, we do one more max-flow:

All the remaining $x$ values are still active, meaning that we have found the desired instrumental set!

It turns out that we can get a lot of mileage out of this simple procedure. You see, most state-of-the-art identification algorithms have a similar structure to the instrumental set, where they needed to enumerate all subsets of edges incoming into a target node. We were able to adapt the match-block to these algorithms, and even extend it into a new method, which can solve for certain edges that no other efficient method we know of can solve! We called our method the “Instrumental Cutset”, and you can read about the details in our full paper on arXiv.

- P. G. Wright,
*Tariff on Animal and Vegetable Oils*. Macmillan Company, New York, 1928. - J. Pearl,
*Aspects Of Graphical Models Connected With Causality*. 1993. - J. Pearl,
*Causality: Models, Reasoning and Inference*. 2000. - C. Brito and J. Pearl, “Generalized Instrumental Variables,” in
*Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence*, 2002, pp. 85–93. - J. D. Angrist, G. W. Imbens, and D. B. Rubin, “Identification of Causal Effects Using Instrumental Variables,”
*Journal of the American Statistical Association*, vol. 91, no. 434, pp. 444–455, 1996. - R. Foygel, J. Draisma, and M. Drton, “Half-Trek Criterion for Generic Identifiability of Linear Structural Equation Models,”
*The Annals of Statistics*, pp. 1682–1713, 2012. - J. Tian and J. Pearl, “A General Identification Condition for Causal Effects,” p. 7, 2002.