# Auxiliary Cutsets

## Efficient Identification in Linear SCM, Part 2

July 09, 2020

We’re going to present “Efficient Identification in Linear Structural Causal Models with Auxiliary Cutsets”, a joint work with Carlos Cinelli and Elias Bareinboim, at ICML 2020. Like my previous paper on the topic, explained here, this work is quite technical, and requires a relatively strong background in statistics.

Nevertheless, the core ideas underlying our method are quite approachable. This post serves as a stand-alone introduction to the problem of identification in linear models, and gives a taste of our algorithm. It is my goal to make the first section accessible to anyone who can understand a scatter plot and a linear regression, the second section comprehensible to those with relatively strong mathematical background, and the remaining sections focused on those who have dealt with path analysis or linear SCM before.

You can view our presentation here.

## What Problem Are We Solving?

Suppose we’re medical researchers, and our goal is to find out if an existing medicine helps in fighting a new disease. We gather a dataset of people who voluntarily took various quantities of the drug to treat other conditions, and measure the amount of a biomarker in their blood (such as antibodies to the target virus). The more biomarker, the better. We get the following dataset: The first thing that we do to analyze the data is to find a linear best-fit. By performing a least-squares regression, the slope of the line will tell us whether the amount of drug taken is positively correlated with the amount of biomarker. We fit $Y=\beta X+\epsilon$ to this data, giving $\beta =0.375$: The slope is clearly positive, meaning that the people who took more of the drug fared better on average than those who didn’t. With this clear result, we happily give the new “miracle drug” to the entire population, which gives us a new dataset: This new dataset (yellow points) seems to show the opposite result from the original data (blue points). There is a clear negative correlation between the amount of drug taken and the biomarker of interest, meaning that the drug actually hurts people! What went wrong with the original dataset? Or, more importantly, given only the original data (blue points), could we have found out that this drug is harmful? The question of whether we can find the true causal effect from only observational data is usually called the problem of “identification”.

This problem is impossible to solve without information beyond the data, namely our knowledge of the context in which the original data was gathered. Here we will encode this knowledge through what are known as “Structural Causal Models”.

## Encoding Context: Structural Causal Models

Here, we focus on linear models. A linear Structural Causal Model (SCM), is a system of linear equations that encodes assumptions about the causal relationships between variables. When performing the linear regression, we implicitly assumed that the amount of drug taken, $X$, and the amount of biomarker in blood, $Y$, are causally related as follows:

In the above, ${\lambda }_{xy}$ represents the direct causal effect of $X$ on $Y$. Here it is the amount that $Y$ changes per unit change in $X$. The $ϵ$ in the equations summarizes the effects of unobserved causes. The assignment operator ($:=$) was used here because the equations are causal, and their effect only goes one way. That means that if we were to change the value of $Y$ directly (perhaps by injecting the person with the desired antibodies), it would not affect the value of $X$ (the amount of medicine the person took), but changing the amount of medicine they take does change their antibody count. Critically, we assumed that ${ϵ}_{x}$ is independent of ${ϵ}_{y}$, meaning that there are no unobserved common causes of both variables.

Let’s see what a linear regression computes here. We set up the regression equation: $Y=\beta X+\epsilon$, with the hope that the solved value is $\beta ={\lambda }_{xy}$. A least-squares regression finds the value of $\beta$ that minimizes $\left(Y-\beta X{\right)}^{2}$ over the entire dataset. In other words,

$\beta =\underset{\beta }{\mathrm{argmin}}\frac{1}{n}\sum _{i=1}^{n}\left({y}_{i}-\beta {x}_{i}{\right)}^{2}=\underset{\beta }{\mathrm{argmin}}\mathbb{E}\left[\left(Y-\beta X{\right)}^{2}\right]$

Recall that the covariance between $X$ and $Y$ is defined ${\sigma }_{xy}=\mathbb{E}\left[\left(X-\mathbb{E}\left[X\right]\right)\left(Y-\mathbb{E}\left[Y\right]\right)\right]=\mathbb{E}\left[XY\right]-\mathbb{E}\left[X\right]\mathbb{E}\left[Y\right]$. To simplify the math (without loss of generality), let’s assume that $X$ and $Y$ are normalized, meaning that the data has mean 0 and variance 1. This makes ${\sigma }_{xy}=\mathbb{E}\left[XY\right]$. With this, we can derive the solution to $\beta$ with a bit of calculus, which gives:

$\beta ={\sigma }_{xy}$
Show the derivation

Start out by expanding out the least-squares equation, exploiting the fact that normalized variables have variance 1:

$\begin{array}{rl}\mathbb{E}\left[\left(Y-\beta X{\right)}^{2}\right]& \mathrm{=\mathbb{E}\left[YY-2\beta XY+{\beta }^{2}XX\right]}\\ \mathrm{=\mathbb{E}\left[YY\right]-2\beta \mathbb{E}\left[XY\right]+{\beta }^{2}\mathbb{E}\left[XX\right]}\\ \mathrm{=1+{\beta }^{2}-2\beta \mathbb{E}\left[XY\right]}\\ \mathrm{=1+{\beta }^{2}-2\beta {\sigma }_{xy}}\end{array}$

Then, to find the value of $\beta$ that minimizes this, take the derivative:

$\begin{array}{rl}0& \mathrm{=\frac{\partial \mathbb{E}\left[\left(Y-\beta X{\right)}^{2}\right]}{\partial \beta }=\frac{\partial }{\partial \beta }\left(1+{\beta }^{2}-2\beta {\sigma }_{xy}\right)}\\ \mathrm{=2\beta -2{\sigma }_{xy}}\\ \beta & \mathrm{={\sigma }_{xy}}\end{array}$

A regression computes the covariance between variables when variables are normalized… What does that have to do with ${\lambda }_{xy}$? It turns out that the covariances and structural coefficients (${\lambda }_{xy}$) in the equations underlying the data are deeply related:

${\sigma }_{xy}=\mathbb{E}\left[XY\right]=\mathbb{E}\left[X\left({\lambda }_{xy}X+{ϵ}_{y}\right)\right]={\lambda }_{xy}\mathbb{E}\left[XX\right]+\mathbb{E}\left[{ϵ}_{x}{ϵ}_{y}\right]={\lambda }_{xy}$

Here, the covariance of $X$ and $Y$ matches the causal effect of $X$ on $Y$. This seems to imply that the analysis done in section 1 was correct, so why did it give the wrong answer?

It turns out that the medicine in question is extremely expensive, so only the rich can afford to take large amounts. Wealthy people are more likely to get better anyways, simply because they don’t need to keep working while they’re sick, and can focus on recovery! The true model was in fact:

Since wealth was not gathered as part of the dataset, it is a latent confounder, represented by a correlation between the $ϵ$ values, and is shown in the causal graph as a bidirected dashed edge. Now, defining ${ϵ}_{xy}\equiv \mathbb{E}\left[{ϵ}_{x}{ϵ}_{y}\right]$, the regression gives us:

$\beta ={\sigma }_{xy}={\lambda }_{xy}\mathbb{E}\left[XX\right]+\mathbb{E}\left[{ϵ}_{x}{ϵ}_{y}\right]={\lambda }_{xy}+{ϵ}_{xy}$

The first term in the result, ${\lambda }_{xy}$, is the causal effect of the drug, which we are after; but the second, ${ϵ}_{xy}$, is the effect of wealth on both taking the drug, and getting the biomarker, which is not what we wanted. Unfortunately, without any more information, it is impossible to disambiguate between the two (one equation, two unknown variables), meaning that we can’t tell how much of the correlation comes from the casual effect of the drug, and how much from the confounding. In this situation, we call ${\lambda }_{xy}$ not identifiable. In other words, there is no way to find what would happen if the drug were given to everyone, independently of their wealth.

One fix for this would be to gather the wealth data explicitly from every person in the study. However, some people might not want to share such information. All is not lost, though. Suppose each subject in the study has a doctor who recommended taking the drug lightly, strongly, or anywhere in between. The amount of the drug they took was directly influenced by their doctors’ recommendation. Critically, the decision of the doctor was not based on the wealth of the patients, or other confounders (e.g, the doctor was recommending the drug for another condition, unrelated to the biomarker of interest). If we gather a new dataset which includes this recommendation level, we get:

In this situation, regressing $Y$ on $X$ is still biased. Nevertheless, we can combine the data in a different way to obtain the causal effect:

${\sigma }_{zy}=\mathbb{E}\left[ZY\right]=\mathbb{E}\left[Z\left({\lambda }_{xy}X+{ϵ}_{y}\right)\right]={\lambda }_{xy}\mathbb{E}\left[ZX\right]+\mathbb{E}\left[{ϵ}_{z}{ϵ}_{y}\right]$${\sigma }_{zy}={\lambda }_{xy}{\sigma }_{zx}\phantom{\rule{1cm}{0ex}}$

If our model is right, the desired causal effect can be solved as a ratio of two covariances/regressions! This method is usually called the instrumental variable, and is extremely common in the literature . Estimation of that ratio is typically achieved using 2-stage least squares.

2-Stage Least Squares

As the name suggests, rather than using a ratio of two regressions, a 2SLS uses the result of one regression to adjust the other. In particular, first the regression $X={\beta }_{x}Z+\epsilon$ is performed, which gives ${\beta }_{x}={\sigma }_{xz}$.

Then, the resulting ${\beta }_{x}Z$ is plugged into another regression equation, replacing $X$:

$Y={\beta }_{y}\left({\beta }_{x}Z\right)+\epsilon \phantom{\rule{1cm}{0ex}}$

If we were to perform a regression $Y=\beta Z+\epsilon$, we would get $\beta ={\sigma }_{zy}$. We know from the definition of least-squares that this value minimizes $\mathbb{E}\left[\left(Y-\beta Z{\right)}^{2}\right]$. So what value would minimize the compounded regression?

$\underset{\beta }{\mathrm{argmin}}\mathbb{E}\left[\left(Y-{\beta }_{y}\left({\sigma }_{xz}Z\right){\right)}^{2}\right]=\underset{\beta }{\mathrm{argmin}}\mathbb{E}\left[\left(Y-\left({\beta }_{y}{\sigma }_{xz}\right)Z{\right)}^{2}\right]$${\sigma }_{zy}={\beta }_{y}{\sigma }_{zx}\phantom{\rule{1cm}{0ex}}$

Despite the procedure being different, the end effect is the same: the resulting value corresponds to the same ratio of covariances that was derived above.

In general, no single regression gives the desired answer, it is only through a clever combination of steps based on the underlying causal model that an adjustment formula can be derived, giving the causal effect.

Finding this adjustment formula, given our model of the world (structural equations and causal graph) is the goal of identification. If such a formula exists, an identification algorithm would return it, and if not, the algorithm would say that the desired effect cannot be found.

Most state-of-the-art methods for identification look for patterns in the causal graph which signal that mathematical tricks like the one shown above can be used to solve for a desired parameter. Such graphical methods focus on paths and flows between sets of variables, as will be demonstrated in the next section.

## Auxiliary Variables

Suppose we have the following structural model, and want to find the causal effect of $X$ on $Y$, ${\lambda }_{xy}$.

Expanding out the covariance between $X$ and $Y$ gives ${\sigma }_{xy}={\lambda }_{xy}+{\lambda }_{zx}{ϵ}_{zy}$. This is a combination of the causal effect ${\lambda }_{xy}$, and the back-door path from $X$, to $Z$, to $Y$, namely ${\lambda }_{zx}{ϵ}_{zy}$ (this can be read directly off of the causal graph, see path analysis ).

Experts in the field might notice that a simple regression of $Y$ on $X$ adjusting for $Z$ gives ${\lambda }_{xy}$ (known as the backdoor adjustment ). We will approach the problem from a different angle, which turns out to be easily extensible to complex models where conditioning fails. Our goal is to somehow remove the effect of ${\lambda }_{zx}{ϵ}_{zy}$ from ${\sigma }_{xy}$, which will leave only the desired parameter.

We turn to Auxiliary Variables , allowing usage of previously-solved parameters to help solve for others. In this example, ${\sigma }_{zx}={\lambda }_{zx}$, so we have solved the causal effect of $Z$ on $X$. We use this to define a new “Auxiliary” variable ${X}^{\ast }=X-{\lambda }_{zx}Z$, which subtracts the effect of $Z$ on $X$, and has a cool property:

${X}^{\ast }=X-{\lambda }_{zx}Z=\left({\lambda }_{zx}Z+{ϵ}_{x}\right)-{\lambda }_{zx}Z={ϵ}_{x}$

This new variable ${X}^{\ast }$ behaves as if it were not affected by $Z$ at all - as if the edge ${\lambda }_{zx}$ were missing in the causal graph. This means that the covariance of ${X}^{\ast }$ and $Y$ no longer includes the backdoor path, and can be used to compute the desired value similarly to an instrumental variable:

$\frac{{\sigma }_{{x}^{\ast }y}}{{\sigma }_{{x}^{\ast }x}}=\frac{\mathbb{E}\left[{X}^{\ast }Y\right]}{\mathbb{E}\left[{X}^{\ast }X\right]}=\frac{\mathbb{E}\left[{X}^{\ast }\left({\lambda }_{xy}X+{ϵ}_{y}\right)\right]}{\mathbb{E}\left[{X}^{\ast }X\right]}=\frac{{\lambda }_{xy}\mathbb{E}\left[{X}^{\ast }X\right]}{\mathbb{E}\left[{X}^{\ast }X\right]}={\lambda }_{xy}$

This method misses some simple cases, though.

## Total Effect AV

And here, finally, is where our paper’s contributions begin. In the following example, the edges incoming into $X$ in the causal graph, ${\lambda }_{wx}$ and ${\lambda }_{zx}$, cannot be solved, meaning that no AV ${X}^{\ast }$ can be created:

Nevertheless, the total effect of $W$ on $X$, denoted by ${\delta }_{wx}={\lambda }_{wx}+{\lambda }_{wz}{\lambda }_{zx}$ is simply ${\sigma }_{wx}$, so we can create a new type of AV,

${X}^{†}=X-{\delta }_{wx}W$

Once again, a bit of math shows that this AV can be used to successfully solve for the desired ${\lambda }_{xy}$:

$\begin{array}{rl}{\sigma }_{{x}^{†}y}& \mathrm{=\mathbb{E}\left[{X}^{†}Y\right]=\mathbb{E}\left[{X}^{†}\left({\lambda }_{xy}X+{ϵ}_{y}\right)\right]={\lambda }_{xy}\mathbb{E}\left[{X}^{†}X\right]+\mathbb{E}\left[{X}^{†}{ϵ}_{y}\right]}\\ \mathrm{={\lambda }_{xy}{\sigma }_{{x}^{†}x}+\mathbb{E}\left[\left(X-{\delta }_{wx}W\right){ϵ}_{y}\right]={\lambda }_{xy}{\sigma }_{{x}^{†}x}+\mathbb{E}\left[X{ϵ}_{y}\right]-{\delta }_{wx}\mathbb{E}\left[W{ϵ}_{y}\right]}\\ \mathrm{={\lambda }_{xy}{\sigma }_{{x}^{†}x}+\mathbb{E}\left[\left({\lambda }_{wx}W+{\lambda }_{zx}\left({\lambda }_{wz}W+{ϵ}_{z}\right)+{ϵ}_{x}\right){ϵ}_{y}\right]-{\delta }_{wx}\mathbb{E}\left[W{ϵ}_{y}\right]}\\ \mathrm{={\lambda }_{xy}{\sigma }_{{x}^{†}x}+\left({\lambda }_{wx}+{\lambda }_{wz}{\lambda }_{zx}\right)\mathbb{E}\left[W{ϵ}_{y}\right]-{\delta }_{wx}\mathbb{E}\left[W{ϵ}_{y}\right]}\\ \mathrm{={\lambda }_{xy}{\sigma }_{{x}^{†}x}}\end{array}$

Of course, the situation becomes much more complex once there are multiple back-door paths between $X$ and $Y$. The following graph has back-door paths between $X$ and $Y$ passing through $A$ and $D$, but if we try defining a total-effect AV by individually subtracting out total effects, ${X}^{‡}=X-{\delta }_{ax}A-{\delta }_{dx}D$, we get a biased answer:

The key insight needed to solve this problem is that the total-effect of $A$ on $X$ has a part passing through $D$ - but that part of the effect is already removed when subtracting ${\delta }_{dx}D$! This means that the correct approach is subtracting out a partial effect - the portion of the effect of $A$ on $X$ that does not pass through $D$, denoted by ${\delta }_{ax.d}$, which gives the AV ${X}^{†}=X-{\delta }_{ax.d}A-{\delta }_{dx.a}D$