---
title: "Estimating controlled direct effects"
date: "`r Sys.Date()`"
link-citations: yes
bibliography: DirectEffects.bib
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Estimating controlled direct effects}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r loadpkg, echo = FALSE, include = FALSE}
library(DirectEffects) 
```


This vignette demonstrates how to estimate the controlled direct effect of some treatment, net the effect of some mediator using sequential g-estimation approach as described by  @AchBlaSen16. With standard regression and matching approaches, estimating the controlled direct effect is difficult because it often requires adjusting for covariates that causally affect the mediator and are causally affected by the treatment. Including such variables in a regression model can lead to post-treatment bias when estimating the direct effect of treatment. A large class of models and estimation approaches have been developed to avoid this problem, but here we focus on one called sequential g-estimation. This approach uses a standard linear model to  estimate the effect of the mediator, and then removes that effect from effect from the dependent variable to estimate the direct effect of treatment without post-treatment bias. 

## Quantity of interest 

Here we briefly introduce some of the statistical theory behind estimating direct effects. The main quantity of interest is the average controlled direct effect or ACDE. This can be illustrated from the following causal structure [@AchBlaSen16, Figure 3]:

![DAG with direct effects structure](figures/ABS_fig3.png){width=600px}

The average controlled direct effect defined for a given treatment ($A_i = a$ vs. $A_i = a^\prime$) and a given value of the mediator ($M_i = m$) is 
$$ACDE(a, a^\prime, m) = E[Y_i(a, m) - Y_i(a^\prime, m)]$$
and is the total of the dashed lines in the figure above. Thus, we hold the mediator constant when comparing the outcome under each treatment. The ACDE is useful for discriminating between causal mechanisms, because the average total effect of treatment, $\tau(a, a^\prime) \equiv E[Y_i(a) - Y_i(a^\prime)]$, can be decomposed as the sum of three quantities: (1) the ACDE, (2) the average natural indirect effect, and (3) a reference interaction that measures of how much the direct effect of $A_i$ depends on a particular $M_i = m$. Thus, if the ACDE is non-zero, it is evidence that the effect of $A_i$ is not entirely explained by $M_i$. We illustrate this with an empirical example. For more information on this, see @AchBlaSen16. 


## Implementation 

This vignette introduces a new way to design, specify, and implement various estimators for controlled direct effects using a coding style that we refer to as the "assembly line." Users can specify the various components of the estimators sequentially and easily swap out implementation details such as what exact estimators are used for the estimation of each nuisance function or what arguments to pass to these functions. A unified structure and output across different research designs and estimators should make comparing estimators very easy. Finally, this implementation is built around the idea of cross-fitting for nuisance function estimation. 

Let's see how this works in practice. We first load a subset of the `jobcorps` data from the package:
```{r data}
data("jobcorps")
jc <- jobcorps[
  1:200,
  c("treat", "female", "age_cat", "work2year2q", "pemplq4", "emplq4", "exhealth30")
]
```

This is the Job Corps experiment data used in @Hub14 to estimate the effect of a randomized job training program on self-assessed health holding constant the intermediate outcome of employment. We use a subset of the variables from their study, but the basic variable we use here:

* $Y_i$ is an indicator for whether participants reported "very good" health after 2.5 years out after randomization (`exhealth30`)
* $A_i$ is an indicator for assignment to the job training program (`treat`)
* $M_i$ is an indicator for employment 1 to 1.5 years after assignment (`work2year2q`)
* $Z_i$ are the post-treatment, pre-mediator intermediate confounders (`emplq4`, `pemplq4`)
* $X_i$ are the pre-treatment characteristics (`female`, `age_cat`)


In context of our CDE estimators, we refer to be $A_i$ (the treatment) and $M_i$ (the mediator) as *treatment variables* because we are interested in causal effects that contrast different combinations of these two variables. 

The first step in our assembly line is to choose the type of CDE estimator (or really estimation strategy) that we want to pursue. These choices include IPW (`cde_ipw()`), regression imputation (`cde_reg_impute()`), and augmented IPW (`cde_aipw()`) among others. Some of these different functions will take different arguments to specify different aspects of the estimator. The AIPW estimator is useful to demonstrate because it uses many of the features of the assembly line approach. 

```{r aipw}
my_aipw <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)
```

The call here consists of several steps connected by R's pipe operator. The first call `cde_aipw()` simply initializes the estimator as taking the AIPW form, meaning that it will use both propensity score models and outcome regressions to estimate the causal effects. 

Following this we have two blocks of 3 lines each that look similar with a few differences. These "treatment blocks" specify the causal variables of interest **in their causal ordering**. In `set_treatment()` we set the name of the treatment variable and give a formula describing the pre-treatment covariates for this treatment. This formula, unless overridden by the next two functions, will be used in the fitting of the propensity score and outcome regression models. `treat_model()` allows the user to specify how to fit the propensity score model for this treatment with a choice of engine and optional arguments to pass to each engine. `outreg_model()` does the same for the outcome regression. We repeat this specification for the mediator. 

The `estimate()` function finally implements the specifications given in the previous commands. Users use a formula to set the outcome and which of the treatment variables we want effects for and indicate the data frame where all of these variables can be found. By default, the nuisance functions (the propensity score model and the outcome regression) are fit using cross-fitting. 

We can either use the `summary()` or `tidy()` functions to get a summary of the different effects being estimated:

```{r summary}
summary(my_aipw)
tidy(my_aipw)
```

Because the `estimate()` function included both `treat` and `work2year2q` in the formula, the output includes both  the controlled direct effects of the treatment and the conditional average treatment effect of the mediator. Furthermore, by default, the output contains estimates for an effect for each combination of other treatment variables. This can be changed by setting the `eval_vals` argument in the relevant `set_treatment()` call. 

The modular structure of the call allow users to easily swap out parts of the models. For example, we could easily alter the above call to use lasso versions of the regression and logit model.

```{r aipw_lasso, eval = FALSE}
my_aipw_lasso <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)
```

Here we use a "rigorous" version of the lasso from the `hdm` package to speed up computation. There are a number of different engines for different outcome and treatment types including:

- GLMs: `lm` (OLS), `logit` (logistic regression), `multinom` (mulitnomial logit)
- LASSO estimators (from `glmnet`): `lasso`, `lasso_logit`, `lasso_multinom`
- Rigorous LASSO estimator (from `hdm`): `rlasso`, `rlasso_logit`


## References