3 Selection on Observables (I)

Randomisation is a powerful tool because it means that confounders can be safely ignored by researchers as, in expectation, they will be balanced across treatment and control groups. Sadly, some of the most interesting social science questions cannot be addressed using randomised experiments (Why? First, because experiments are costly, and second, because it would be bad form to randomly assign, for instance, the institutions that govern a country’s electoral system, or whether you get a distinction in your degree). When it is not possible to randomise, how can we make valid causal inferences? In the next two lectures, we discuss methods for non-experimental data which assume that selection into treatment groups is based on observable factors. This week we focus on subclassification and matching.

For a theoretical discussion of the selection on observables identification strategy, the MHE chapter on regression is very good (especially page 51 onwards). There is also a very nice exposition of the conditional independence assumption as it specifically relates to matching in this paper by Jasjeet S Sekhon (that paper also has a fantastic title). For practical advice on matching, the best resource is probably the Elizabeth Stuart paper. This paper gives lots of straightforward recommendations about the different decisions one has to make when implementing different matching estimators. Applied examples are found in Eggers and Hainmueller (2009) and Dehejia and Wahba (1999). There is also a famous example of subclassification in this 1968 paper by Cochran.

Finally, in the lecture we discuss the problem of conditioning on post-treatment variables when estimating causal effects. There are many papers on this subject, but two of the more accessible ones are this one by Montgomery et al, which focuses on post-treatment bias in experimental settings; and this one by Acharya et al, which focuses on post-treatment bias in observational settings. Both are worth reading.


3.1 Seminar

Data

We will use two datasets this week, each drawn from one of the papers on the reading list. Download these datasets now from the links at the top of the page, and store them in the data folder that you created last week.

Installing and loading packages

This week we will be using some additional functions that do not come pre-installed with R. There are many additional packages for R for many different types of quantitative analysis. Today we will be using the Matching package, which provides several helpful functions for implementing various matching strategies. You will also need the foreign package, which allows R to read in data of different formats (such as STATA and SPSS).

To get started, you will need to install these packages on whatever computer you are using. Note: you only need to install a package on a computer once. Do not run this code every time you run your R script!

# For the matching
install.packages("MatchIt")
install.packages("rgenoud") 

# For data loading
install.packages("foreign")

Once installed, you need to load the packages using the library() function.

library(MatchIt)
library(foreign)

# For the formatting of some tables we will also use the following
library(sjPlot)
library(kableExtra)
library(magrittr)

Finally, as this problem set will involve some random number generation, we will use the set.seed() function at the top of our scripts to ensure that the results remain the same every time we run them. You specify the function as follows:

set.seed(12345)

where you can have any number in the place of 12345.

The main function we will use is matchit(), from the MatchIt package. The function takes a number of arguments, the most important of which are listed in the table below. Note that, depending on the matching method, not all of the argument will be needed.

Arguments to the matchit() function.
Argument Purpose
formula A two-sided formula in the form of treatment_variable ~ covariates
data A data frame containing the variables in formula.
method The matching method to be used. Available options are, among others, "exact", "nearest" and "genetic".
distance If applicable, the distance measure to be used.
estimand Either "ATT" to calculate the average treatment effect for the treated (the default), "ATE" for the average treatment effect, or "ATC" for the average treatment effect for the controls.
replace For methods that allow it,, should matched observations be re-used?
ratio For methods that allow it, how many control units should be matched to each treated unit in k:1 matching?
m.order For methods that allow it, the order that the matching takes place. Set to "random" when ties should be broken randomly.

3.1.1 MPs for Sale? – Eggers and Hainmueller (2009)

What is the monetary value of serving as an elected politician? It is firmly a part of received wisdom that politicians often use their positions in public office for self-serving purposes. There is much evidence that private firms profit from their connections with influential politicians, but evidence that politicians benefit financially because of their political positions is thin. Eggers and Hainmueller (2009) seek to address this question by asking: what are the financial returns to serving in parliament? They study data from the UK political system, and compare the wealth at the time of death for individuals who ran for office and won (MPs) to individuals who ran for office and lost (candidates) to draw causal inferences about the effects of political office on wealth.

The data from this study is in Rdata format, and provided that you have downloaded the data and placed it within your data folder (and you have correctly set your working directory), then it can be loaded using the load() function as follows:

load("data/eggers_mps.Rdata")

This data – which is stored in the mps object – includes observations of 425 individuals. There are indicators for the main outcome of interest – the (log) wealth at the time of the individual’s death (lnrealgross) – and for the treatment – whether the individual was elected to parliament (treated == 1) or failed to win their election (treated == 0). The data also includes information on a rich set of covariates.

The names and descriptions of variables are:

  1. labour – binary (1 if the individual was a member of the Labour party, 0 otherwise)
  2. tory – binary (1 if the individual was a member of the Conservative party, 0 otherwise)
  3. yob – year of birth
  4. yod – year of death
  5. female – binary (1 if the individual was female, 0 otherwise)
  6. aristo – binary (1 if the individual held an aristocratic title, 0 otherwise)
  7. Variables pertaining to the secondary education of the individual, all prefixed scat_ (see the paper for details)
  8. Variables pertaining to the university education of the individual, all prefixed ucat_ (see the paper for details)
  9. Variables pertaining to the pre-treatment occupation of the individual, all prefixed oc_ (see the paper for details)
  1. Estimate the average treatment effect using either a t.test or a bivariate linear regression. What is it? Is it significantly different from zero? Can we interpret this as an unbiased estimate of the causal effect of elected office on wealth?
t.test(mps$lnrealgross[mps$treated ==  1],
       mps$lnrealgross[mps$treated ==  0])
## 
##  Welch Two Sample t-test
## 
## data:  mps$lnrealgross[mps$treated == 1] and mps$lnrealgross[mps$treated == 0]
## t = 4.9215, df = 333.28, p-value = 1.353e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3108102 0.7247022
## sample estimates:
## mean of x mean of y 
##  12.93624  12.41848
summary(lm(lnrealgross ~ treated, data = mps))
## 
## Call:
## lm(formula = lnrealgross ~ treated, data = mps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9853 -0.4925 -0.0192  0.4244  3.5180 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.41848    0.06466  192.07  < 2e-16 ***
## treated      0.51776    0.10377    4.99 8.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.043 on 423 degrees of freedom
## Multiple R-squared:  0.05558,    Adjusted R-squared:  0.05335 
## F-statistic:  24.9 on 1 and 423 DF,  p-value: 8.853e-07

The regression suggests that gross log wealth of those in the treated group – those individuals who were elected to parliament – was 0.52 points higher than for those in the control group. In other words, elected candidates were worth, on average, 67% (i.e. (exp(0.517)-1)*100) more at death than candidates who failed to win elections. The difference is significantly different from zero.

This is clearly not a causal difference, as there are many potentially confounding differences between those who are elected and those who are not elected. That is, it is likely that the difference estimated in this regression is subject to selection bias.

  1. Using the matchit() function from the MatchIt package, carry out exact 1:1 matching to estimate the average treatment effect on the treated. Match on gender, aristocratic title, and whether the individual received their secondary education from Eton. What is the ATT? How many matched observations are there? Code Hint: Use the command match.out <- matchit([formula], [data], method ="exact",estimand = "ATT",ratio = 1). Then extract the data with match.data() to then estimate the ATT with lm().
## matching
match.out <- matchit(treated ~ female + aristo + scat_eto, 
                     data = mps,
                     method = "exact",
                     estimand = "ATT",
                     ratio = 1) # 1:1 

## look at the number of observations after matching
summary(match.out)$nn 
##                Control Treated
## All (ESS)     260.0000     165
## All           260.0000     165
## Matched (ESS) 153.5217     163
## Matched       260.0000     163
## Unmatched       0.0000       2
## Discarded       0.0000       0
## extract data
matched.data <- match.data(match.out) 

## estimate model with matched data, without covariates and with weights
est <- lm(lnrealgross ~ treated, data = matched.data, weights = weights)
coef(est)[2]
##   treated 
## 0.5439035

The ATT (the coefficient associated with the variable treated) is equal to 0.5439. There are 163 matched observations. This number comes from the fact that we are matching for the treated units, where there are 165 in the original data, and from the fact that two treated units were dropped because they could not be matched on these covariates. (If you are curious, try to find out who these individuals are in the mps data using mps[which(match.out$weights==0),]. Spoiler alert: they were both male, aristocrat and had gone to Eton.)

  1. Evaluate the balance between treated and control observations on gender, aristocratic title, and whether the individual received their secondary education from Eton. Do this for the raw data, and then for the matched data with by regressing each covariate on the treatment variable. Code Hint: For the matched data, you need to specify the weights.
## Pre-match balance
m.female <- lm(female ~ treated, data = mps)
m.aristo <- lm(aristo ~ treated, data = mps)
m.scat_eto <- lm(scat_eto ~ treated, data = mps)

## Post-match balance
m.female.match <- lm(female ~ treated, data = matched.data, weights = weights)
m.aristo.match <- lm(aristo ~ treated, data = matched.data, weights = weights)
m.scat_eto.match <- lm(scat_eto ~ treated, data = matched.data, weights = weights)

# You can look at each model individually with summary(model.name).
# Alternatively, below is some code to put this all in a neat table. 
# PS: the code is horrible to look at, I know. 

female <- get_model_data(m.female,"est",terms = c("treated"))
female.match <- get_model_data(m.female.match,"est",terms = c("treated"))

aristo <- get_model_data(m.aristo,"est",terms = c("treated"))
aristo.match <- get_model_data(m.aristo.match,"est",terms = c("treated"))

scat_eto <- get_model_data(m.scat_eto,"est",terms = c("treated"))
scat_eto.match <- get_model_data(m.scat_eto.match,"est",terms = c("treated"))

out <- data.frame(
  "var" = c("female","aristo","scat_eto"),
  "raw" = c(paste0(female$p.label," (",round(female$std.error,2),")"),
            paste0(aristo$p.label," (",round(aristo$std.error,2),")"),
            paste0(scat_eto$p.label," (",round(scat_eto$std.error,2),")")),
  "match" = c(paste0(female.match$p.label," (",round(female.match$std.error,2),")"),
              paste0(aristo.match$p.label," (",round(aristo.match$std.error,2),")"),
              paste0(scat_eto.match$p.label," (",round(scat_eto$std.error,2),")"))
)

colnames(out) <- c("Variable", "Raw data","Matched data")
kable(out, booktabs =T) %>% kable_paper() 
Variable Raw data Matched data
female -0.01 (0.02) -0.00 (0.02)
aristo 0.06 *** (0.02) -0.00 (0.02)
scat_eto 0.10 *** (0.02) -0.00 (0.02)

In the raw data, there is significant imbalance between treatment and control groups with respect to the aristo and scat_eto variables. After matching, there is no imbalance between matched treatment and control groups with respect to any of these variables. This makes sense, as the exact matching process implies that the distribution of these covariates should be identical in the treatment and control groups.

  1. Apply the summary() function on the output from matchit() and create a plot of the standardised mean differences. Code Hint: Have a look at the helpfile with ?plot.summary.matchit.
# summary(match.out) 
plot(summary(match.out), 
     abs = F,
     var.order = "unmatched")

The output here is more detailed, and easier to read, that the output we got from the t-tests. Of particular interest is the standardized mean difference (Std. Mean Diff.) columns of the output, which calculate the quantity that we saw in lecture for assessing balance before and after matching. As a reminder, this quantity is equal to:

\[ \text{bias}_{X_i} = \frac{\bar{X}_t - \bar{X}_c}{\sigma_{t}} \]

where \(\bar{X}_t\) is the mean of the covariate in the treatment group, \(\bar{X}_c\) is the covariate mean in the control group, and \(\sigma_{t}\) is the standard deviation of \(X\) in the treated group. Ideally, as is the case here, we should see the standardized difference in means between treated and control groups decrease significantly after matching for all covariates. This implies that the groups are comparable on average, at least with respect to these observable factors, as a result of the matching process.

As we can also see in the plot, the standardized difference in means is zero for all covariates in the matched sample, which makes sense given that we are using exact matching!

  1. Rematch the data, this time expanding the list of covariates to include all of the schooling, university and occupation categories. Use exact matching again. What is the ATT? How many observations remain in the matched data?
## matching
match.out2 <- matchit(treated ~ female + aristo + scat_pub + scat_eto +
                        scat_nm + scat_sec + oc_teacherall + oc_barrister +
                        oc_solicitor + oc_dr + oc_civil_serv +
                        oc_local_politics + oc_business + oc_white_collar +
                        oc_union_org + oc_journalist + oc_miner + ucat_ox +
                        ucat_deg + ucat_nm, 
                      data = mps,
                      method = "exact",
                      estimand = "ATT",
                      ratio = 1) 
summary(match.out2)$nn 
##                 Control Treated
## All (ESS)     260.00000     165
## All           260.00000     165
## Matched (ESS)  84.27511     110
## Matched       156.00000     110
## Unmatched     104.00000      55
## Discarded       0.00000       0
## extract data
matched.data2 <- match.data(match.out2) 

## estimate model with matched data 
est2 <- lm(lnrealgross ~ treated, data = matched.data2, weights = weights)
coef(est2)[2]
##   treated 
## 0.6775976

The ATT has increased to 0.68, and there are now far fewer matched observations (110).

  1. Eggers and Hainmueller run the matching analysis separately for Labour and Conservative candidates. Do this now and try to replicate the 3\(^{rd}\) and 6\(^{th}\) columns from table 3 of their paper. As you will see in the paper, they use M=1 genetic matching but you can get quite close to this result using a nearest neighbour 1:1 matching with replacement and mahalanobis distance. Adapt the code given below to do this. In their analysis, they do not only match on the categorical variables we have used so far, they also use the continuous variables for year of birth and year of death (yob,yod). Include these in the new matching procedure.
## 1:1 NN matching with mahalanobis distance and replacement
match.output <- matchit(treatment ~ covariates, 
                        data = data,
                        method = "nearest",
                        distance = "mahalanobis",
                        estimand = "ATT",
                        ratio = 1,
                        replace = T) 
labour_mps <- mps[mps$labour==1,]
tory_mps <- mps[mps$tory==1,]

## matching labour
match.lab <- matchit(treated ~ female + aristo + scat_pub + scat_eto +
                       scat_nm + scat_sec + oc_teacherall + oc_barrister +
                       oc_solicitor + oc_dr + oc_civil_serv +
                       oc_local_politics + oc_business + oc_white_collar +
                       oc_union_org + oc_journalist + oc_miner + ucat_ox +
                       ucat_deg + ucat_nm + yob + yod, 
                     data = labour_mps,
                     method = "nearest",
                     estimand = "ATT",
                     distance = "mahalanobis",
                     ratio = 1,
                     replace = T) 
# extract data
matched.lab <- match.data(match.lab) 

# estimate model with matched data
est.lab <- lm(lnrealgross ~ treated, data = matched.lab, weights = weights)

## matching tories
match.tory <- matchit(treated ~ female + aristo + scat_pub + scat_eto +
                        scat_nm + scat_sec + oc_teacherall + oc_barrister +
                        oc_solicitor + oc_dr + oc_civil_serv +
                        oc_local_politics + oc_business + oc_white_collar +
                        oc_union_org + oc_journalist + oc_miner + ucat_ox +
                        ucat_deg + ucat_nm + yob + yod, 
                      data = tory_mps,
                      method = "nearest",
                      estimand = "ATT",
                      distance = "mahalanobis",
                      ratio = 1,
                      replace = T) 
# extract data
matched.tory <- match.data(match.tory) 

# estimate model with matched data
est.tory <- lm(lnrealgross ~ treated, data = matched.tory, weights = weights)

## output results
lab <- get_model_data(est.lab,"est",terms = c("treated"))
con <- get_model_data(est.tory,"est",terms = c("treated"))

out <- data.frame(
  "n" = c("ATT","Std. Error"),
  "cp"= c(paste0(round(con$estimate,2),con$p.stars),round(con$std.error,2)),
  "lp"= c(paste0(round(lab$estimate,2),lab$p.stars),round(lab$std.error,2))
)
colnames(out) <- c(" ", "Conservatives","Labour")

kable(out, booktabs = T) %>% kable_paper()
Conservatives Labour
ATT 1.04*** 0.16
Std. Error 0.25 0.15

The ATT for Labour candidates is 0.15, and is not statistically significant. The ATT for Conservative candidates is 0.97, and is significantly different from zero. Both of these cases are quite close to the numbers reported in table 3 of the paper.

For those of you who are interested, you can replicate the genetic matching approach that they use in the paper by using method = "genetic". You should also provide a high number to pop.size, for instance 1000. However, this means it will take a while to run. You can decrease the number if you want, but note that this means that the estimates may be quite imprecise.

## labour
match.gen.lab <- matchit(treated ~ female + aristo + scat_pub + scat_eto +
                           scat_nm + scat_sec + oc_teacherall + oc_barrister +
                           oc_solicitor + oc_dr + oc_civil_serv +
                           oc_local_politics + oc_business + oc_white_collar +
                           oc_union_org + oc_journalist + oc_miner + ucat_ox +
                           ucat_deg + ucat_nm + yob + yod, 
                         data = labour_mps,
                         method = "genetic",
                         estimand = "ATT",
                         ratio = 1,
                         replace = T,
                         pop.size = 1000) 
matched.gen.lab <- match.data(match.gen.lab) 
est.gen.lab <- lm(lnrealgross ~ treated, data = matched.gen.lab, weights = weights)

## tories
match.gen.tory <- matchit(treated ~ female + aristo + scat_pub + scat_eto +
                            scat_nm + scat_sec + oc_teacherall + oc_barrister +
                            oc_solicitor + oc_dr + oc_civil_serv +
                            oc_local_politics + oc_business + oc_white_collar +
                            oc_union_org + oc_journalist + oc_miner + ucat_ox +
                            ucat_deg + ucat_nm + yob + yod, 
                          data = tory_mps,
                          method = "genetic",
                          estimand = "ATT",
                          ratio = 1,
                          replace = T,
                          pop.size = 1000) 
matched.gen.tory <- match.data(match.gen.tory) 
est.gen.tory <- lm(lnrealgross ~ treated, data = matched.gen.tory, weights = weights)

# If you ran this, the results would be as in the table below 
Conservatives Labour
ATT 0.85 *** 0.15
Std. Error 0.25 0.15

3.1.2 National Supported Work – Dehejia and Wahba (1999)

The National Supported Work programme was a federal programme in the US which provided work experience to individuals who had faced economic problems in the past. Individuals were randomly assigned to participate in the programme from a pool of applicants between 1975 and 1977, and both treatment and control individuals were interviewed in 1978 and asked about the amount of money they were currently earning.

The file NSW.dw.obs.dta (seminar data 2, above) contains data from from an observational study constructed by Lalonde (1986). The constructed observational study was formed by replacing the randomized control group with a comparison group formed using data from two national public surveys. The idea here is to see whether it is possible to use this observational data to construct unbiased estimates of the effect of the NSW treatment by implementing a matching estimator.

Dehejia and Wahba (1999) used these data to evaluate the potential efficacy of matching. In this task, you will replicate some of Dehejia and Wahba’s findings. The following variables are included in the NSW observational dataset (NSW.dw.obs.dta):

  • treat (1 = experimental treatment group; 0 = observational comparison group)
  • age (age in years)
  • educ (years of schooling)
  • black (1 if black; 0 otherwise)
  • hisp (1 if Hispanic; 0 otherwise)
  • married (1 if married; 0 otherwise)
  • nodegree (1 if no high school diploma; 0 otherwise)
  • educcat (1 = did not finish high school, 2 = high school, 3 = some college, 4 = graduated from college)
  • re74, re75, re78 (real earnings in 1974, 1975 and 1978)

You can load the data file, which is in STATA format, using the read.dta() function:

nsw_obs <- read.dta("data/NSW.dw.obs.dta")
  1. Estimate the difference in means between treatment and control groups using a t-test or a bivariate linear regression. What is the difference in means? Is this likely to represent an unbiased estimate of the average treatment effect? Why, or why not?
obs_model <- lm(re78 ~ treat, nsw_obs)
summary(obs_model)
## 
## Call:
## lm(formula = re78 ~ treat, data = nsw_obs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15750  -9267   1291   9814 105423 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15750.30      79.84  197.28   <2e-16 ***
## treat       -9401.16     801.98  -11.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10850 on 18665 degrees of freedom
## Multiple R-squared:  0.007308,   Adjusted R-squared:  0.007255 
## F-statistic: 137.4 on 1 and 18665 DF,  p-value: < 2.2e-16

The naive difference in means is equal to -9401. This means that people in the control group earn $9401 more on average than those in the treatment group. Eligibility criteria for the NSW programme included conditions regarding previous social and economic problems for the applicants (see p. 1054 in Dehejia and Wahba). This applicant screening necessarily leads to selection bias in comparisons between those selected for the programme and those not selected. The consequence of this bias is that a naive comparison of means between the treatment group and the newly constructed control sample give the misleading impression that participation in the work programme decreased earnings.

  1. Estimate the average treatment effect on the treated using the observational data using matching. You will need to decide which distance measure to use; which variables to include; which choice of M to use; whether to use matching with or without replacement, and so on. Try several specifications and when you are happy, estimate the causal effect.

There are many decisions to make here! This is my attempt at finding the best specification. Note that this takes a while to run (about 45min or more, depending on your processing power). You can speed it up by reducing the option pop.size for the genetic matching, for instance to 50 or even 10, but this will yield less precise estimates and you may get very different results from one run to the other if you do not set.seed(). You can find an excellent explanation of the meaning of the pop.size argument in the answer to this stackoverflow post.2

Essentially, what I am doing here is trying out three matching approaches: nearest neighbour with euclidean distance, nearest neighbour with mahalandobis distance and genetic matching.3 You couls also choose to do exact matching on those variables where it is possible (all categorical variables) by adding the option exact =. This allows us to keep the strengths of exact matching without having to limit ourselves to only binary variables. (I’ve done it without in this example).

This code may be a bit difficult to read and is certainly beyond what is expected of you in this course. I provide it here so you can have a look if you are interested.

set.seed(12345)
# because we are splitting the ties randomly and the genetic matching, setting the 
# seed ensures that we always get the same results.

## create empty list to store our matches in
matches <- list()

## create empty data frame and vectors with options
out <- data.frame("M"=NA,"Replacement"=NA,"Distance"=NA,"ATT"=NA,"p-value"=NA,"Bias"=NA)
dist <- c(rep("euclidean",4),rep("mahalanobis",4),rep("genetic",4))
replace <- c(rep(c(T,F),6))
ratio <- rep(rep(1:2,each=2),3)


## loop over all options
for (i in 1:12) {
  if(dist[i]=="euclidean"){
    match <- matchit(treat ~ age + educ + black + hisp + married + nodegree +
                       re74 + re75, data = nsw_obs,
                     method = "nearest",
                     distance = "euclidean",
                     estimand = "ATT",
                     ratio = ratio[i],
                     replace = replace[i])
  }
  if(dist[i]=="mahalanobis"){
    match <- matchit(treat ~ age + educ + black + hisp + married + nodegree +
                       re74 + re75, data = nsw_obs,
                     method = "nearest",
                     distance = "mahalanobis",
                     estimand = "ATT",
                     ratio = ratio[i],
                     replace = replace[i])
  }
  if(dist[i]=="genetic"){
    match <- matchit(treat ~ age + educ + black + hisp + married + nodegree +
                       re74 + re75, data = nsw_obs,
                     method = "genetic",
                     estimand = "ATT",
                     ratio = ratio[i],
                     replace = replace[i],
                     pop.size = 200)
  }
  
  # store matching result in list
  matches[[i]] <- match
  
  # extract the absolute standardized bias for each variable and calculate the mean 
  s <- mean(abs(summary(match)$sum.matched[,3]))
  
  # create matched data set and estimate ATT 
  d <- match.data(match)
  mod <- lm(re78 ~ treat, data = d, weights = weights)
  info <- get_model_data(mod, type="est", terms = "treat", ci.lvl = .95)
  
  # store the input values and results in a vector and then append to 'out'
  r <- data.frame("M"=ratio[i],"Replacement"=replace[i],
                  "Distance"=dist[i],"ATT"=info$estimate,
                  "p-value"=info$p.value,"Bias"=s)
  out <- rbind(out,r)
}

## clean up output
out <- out[-1,] 
out$Replacement <- ifelse(out$Replacement,"Yes","No")
rownames(out) <- NULL

## as this took a while to run, I *strongly* saving all this for future use 
save(matches,out, file="data/compare_matches.rda")

# find row with the specification with the lowest bias
index <- which(out$Bias==min(out$Bias))

## create table
kable(out, booktabs = T, linesep="", digits = 3,
      caption = "Bias & ATT's from different matches")%>% kable_paper(font_size = 7) %>%
  row_spec(index, bold =T, color ="red") 
Table 3.1: Table 3.2: Bias & ATT’s from different matches
M Replacement Distance ATT p.value Bias
1 Yes euclidean 1139.119 0.165 0.276
1 No euclidean 960.777 0.211 0.330
2 Yes euclidean 1250.104 0.077 0.311
2 No euclidean 969.464 0.149 0.428
1 Yes mahalanobis 667.988 0.427 0.020
1 No mahalanobis 610.820 0.416 0.045
2 Yes mahalanobis 1137.653 0.113 0.030
2 No mahalanobis 487.046 0.433 0.102
1 Yes genetic 933.012 0.380 0.002
1 No genetic 1274.031 0.084 0.036
2 Yes genetic 1561.441 0.026 0.023
2 No genetic 652.547 0.305 0.097

It looks like the best (lowest) mean absolute bias is achieved with genetic matching with replacement and with a 1:1 ratio (0.002). Here it yields an ATT estimate of 933. However, nearest neighbour matching with replacement and 1:1 ratio comes quite close in terms of covariate balance (0.02), with an ATT of 668.

Therefore, and since genetic matching is quite computer-time intensive, I will use that latter matching approach.

match.out.nsw1 <- matches[[5]] # this is where the matching result is stored
plot(summary(match.out.nsw1), abs = F, var.order = "unmatched")

## estimating ATT for use in next question
matched.data.nsw1 <- match.data(match.out.nsw1)
est.nsw1 <- lm(re78 ~ treat, data = matched.data.nsw1, weights = weights)
  1. Visit this webpage and submit your answers to the previous question.
  1. Redo the matching exercise, excluding the variables for pre-treatment earnings in 1974 and 1975 (or including those variables if they were not in your original specification). How important do the earnings in 1974 and 1975 variables appear to be in terms of satisfying the selection on observables assumption?
## matching without past earnings 
match.out.nsw2 <- matchit(treat ~ age + educ + black + hisp + married + nodegree,
                          data = nsw_obs,
                          method = "nearest",
                          distance = "mahalanobis",
                          estimand = "ATT",
                          ratio = 1,
                          replace = T)
matched.data.nsw2 <- match.data(match.out.nsw2)
est.nsw2 <- lm(re78 ~ treat, data = matched.data.nsw2, weights = weights)

## store model results
est1 <- get_model_data(est.nsw1,"est",terms = c("treat"))
est2 <- get_model_data(est.nsw2,"est",terms = c("treat"))

out <- data.frame(
  "n" = c("ATT","Std. Error"),
  "est1"= c(est1$p.label,round(est1$std.error,2)),
  "est2"= c(est2$p.label,round(est2$std.error,2))
  )

colnames(out) <- c(" ", "Model with past earnings","Model without past earnings")

kable(out, booktabs = T) %>% kable_paper()
Model with past earnings Model without past earnings
ATT 667.99 -3483.14 ***
Std. Error 839.23 913.78

Very important! The estimated effect of the treatment variable in the absence of information about previous earning levels is strongly negative and statistically significant. As you can see in Dehejia and Wahba (1999), the treatment effects estimated with the experimental part of the study are $886 (in the Lalonde original paper) and $1,794 (in Dehejia and Wahba). The 1974 and 1975 earnings variables therefore appear to be crucial for uncovering the true causal effect. This should serve as a reminder of the fragility of the selection on observables assumption.


3.2 Quiz

  1. What assumptions must be met for a “selection on observables” design to produce unbiased estimates of a treatment effect?
  1. The treatment must be randomly assigned
  2. The Linearity of Expectations Assumption
  3. The Conditional Independence Assumption and the Common Support Assumption
  4. We don’t need to make any particular assumption
  1. What does the Conditional Independence Assumption (CIA) state?
  1. That the treatment D is independent from the potential outcomes, conditional on variable X
  2. That the treatment D is affected by a variable X, which also causes the outcome variable
  3. That the outcome Y is a condition affecting selection into treatment D
  4. That units self-select into treatment D
  1. Select the only true sentence:
  1. In a selection on observables design ATE, ATT, and ATC are always equal
  2. In a selection on observables design ATE, ATT, and ATC might not be equal even if the CIA is met
  3. In a selection on observables design ATE, ATT, and ATC must be equal if the CIA is met
  4. In a selection on observables design we can never estimate an ATE
  1. Select the only procedure which does not introduce post-treatment bias:
  1. Matching units by variable \(X_1\), which occurs after treatment and is correlated with it
  2. Controlling for variable \(X_2\), which proxies \(X_1\)
  3. Controlling for variable \(X_3\), which occurs before treatment
  4. Dropping observations based on variable \(X_4\), which proxies \(X_1\)
  1. What is the goal of different matching procedures?
  1. To randomise treatment assignment of units in our study
  2. To find, for each treated unit, the closest untreated unit with same outcome
  3. To find, for each treated unit, a (set of) untreated unit as similar as possible with respect to covariates
  4. To control for all possible observable and unobservable covariates

  1. Here is an excerpt and adaptation of the answer: “I’m no expert on the genetic algorithm, but my understanding of what it does is that it makes a bunch of guesses at the optimal values of these scaling factors [the weights that help minimise a distance measure for each covariate.], keeps the ones that ‘do the best’ in the sense of optimizing the criterion. […] [The value given to pop.size] corresponds to the number of guesses at each generation of the algorithm.”↩︎

  2. Note that genetic matching is not directly a distance measure. It “combines [propensity score matching] and [mahalanobis distance matching] and uses optimization to find the distance measure that provides the best balance in the matched dataset” See this post on stackexchange, 2021.↩︎