3 Selection on Observables (I)

Randomisation is a powerful tool because it means that confounders can be safely ignored by researchers as, in expectation, they will be balanced across treatment and control groups. Sadly, some of the most interesting social science questions cannot be addressed using randomised experiments (Why? First, because experiments are costly, and second, because it would be bad form to randomly assign, for instance, the institutions that govern a country’s electoral system, or whether you get a distinction in your degree). When it is not possible to randomise, how can we make valid causal inferences? In the next two lectures, we discuss methods for non-experimental data which assume that selection into treatment groups is based on observable factors. This week we focus on subclassification and matching.

For a theoretical discussion of the selection on observables identification strategy, the MHE chapter on regression is very good (especially page 51 onwards). There is also a very nice exposition of the conditional independence assumption as it specifically relates to matching in this paper by Jasjeet S Sekhon (that paper also has a fantastic title). For practical advice on matching, the best resource is probably the Elizabeth Stuart paper. This paper gives lots of straightforward recommendations about the different decisions one has to make when implementing different matching estimators. Applied examples are found in Eggers and Hainmueller (2009) and Dehejia and Wahba (1999). There is also a famous example of subclassification in this 1968 paper by Cochran.

Finally, in the lecture we discuss the problem of conditioning on post-treatment variables when estimating causal effects. There are many papers on this subject, but two of the more accessible ones are this one by Montgomery et al, which focuses on post-treatment bias in experimental settings; and this one by Acharya et al, which focuses on post-treatment bias in observational settings. Both are worth reading.

3.1 Seminar

Data

We will use two datasets this week, each drawn from one of the papers on the reading list. Download these datasets now from the links at the top of the page, and store them in the data folder that you created last week.

Installing and loading packages

This week we will be using some additional functions that do not come pre-installed with R. There are many additional packages for R for many different types of quantitative analysis. Today we will be using the Matching package, which provides several helpful functions for implementing various matching strategies. You will also need the foreign package, which allows R to read in data of different formats (such as STATA and SPSS).

To get started, you will need to install these packages on whatever computer you are using. Note: you only need to install a package on a computer once. Do not run this code every time you run your R script!

# For the matching
install.packages("MatchIt")
install.packages("rgenoud") 

# For data loading
install.packages("foreign")

Once installed, you need to load the packages using the library() function.

library(MatchIt)
library(foreign)

# For the formatting of some tables we will also use the following
library(sjPlot)
library(kableExtra)
library(magrittr)

Finally, as this problem set will involve some random number generation, we will use the set.seed() function at the top of our scripts to ensure that the results remain the same every time we run them. You specify the function as follows:

set.seed(12345)

where you can have any number in the place of 12345.

The main function we will use is matchit(), from the MatchIt package. The function takes a number of arguments, the most important of which are listed in the table below. Note that, depending on the matching method, not all of the argument will be needed.

Arguments to the `matchit()` function.
Argument	Purpose
`formula`	A two-sided formula in the form of `treatment_variable ~ covariates`
`data`	A data frame containing the variables in `formula`.
`method`	The matching method to be used. Available options are, among others, `"exact"`, `"nearest"` and `"genetic"`.
`distance`	If applicable, the distance measure to be used.
`estimand`	Either `"ATT"` to calculate the average treatment effect for the treated (the default), `"ATE"` for the average treatment effect, or `"ATC"` for the average treatment effect for the controls.
`replace`	For methods that allow it,, should matched observations be re-used?
`ratio`	For methods that allow it, how many control units should be matched to each treated unit in k:1 matching?
`m.order`	For methods that allow it, the order that the matching takes place. Set to `"random"` when ties should be broken randomly.

3.1.1 MPs for Sale? – Eggers and Hainmueller (2009)

What is the monetary value of serving as an elected politician? It is firmly a part of received wisdom that politicians often use their positions in public office for self-serving purposes. There is much evidence that private firms profit from their connections with influential politicians, but evidence that politicians benefit financially because of their political positions is thin. Eggers and Hainmueller (2009) seek to address this question by asking: what are the financial returns to serving in parliament? They study data from the UK political system, and compare the wealth at the time of death for individuals who ran for office and won (MPs) to individuals who ran for office and lost (candidates) to draw causal inferences about the effects of political office on wealth.

The data from this study is in Rdata format, and provided that you have downloaded the data and placed it within your data folder (and you have correctly set your working directory), then it can be loaded using the load() function as follows:

load("data/eggers_mps.Rdata")

This data – which is stored in the mps object – includes observations of 425 individuals. There are indicators for the main outcome of interest – the (log) wealth at the time of the individual’s death (lnrealgross) – and for the treatment – whether the individual was elected to parliament (treated == 1) or failed to win their election (treated == 0). The data also includes information on a rich set of covariates.

The names and descriptions of variables are:

labour – binary (1 if the individual was a member of the Labour party, 0 otherwise)
tory – binary (1 if the individual was a member of the Conservative party, 0 otherwise)
yob – year of birth
yod – year of death
female – binary (1 if the individual was female, 0 otherwise)
aristo – binary (1 if the individual held an aristocratic title, 0 otherwise)
Variables pertaining to the secondary education of the individual, all prefixed scat_ (see the paper for details)
Variables pertaining to the university education of the individual, all prefixed ucat_ (see the paper for details)
Variables pertaining to the pre-treatment occupation of the individual, all prefixed oc_ (see the paper for details)

Estimate the average treatment effect using either a t.test or a bivariate linear regression. What is it? Is it significantly different from zero? Can we interpret this as an unbiased estimate of the causal effect of elected office on wealth?

Using the matchit() function from the MatchIt package, carry out exact matching to estimate the average treatment effect on the treated. Match on gender, aristocratic title, and whether the individual received their secondary education from Eton. What is the ATT? How many matched observations are there? Code Hint: Use the command match.out <- matchit([formula], [data], method ="exact",estimand = "ATT"). Then extract the data with match.data() to then estimate the ATT with lm().

Evaluate the balance between treated and control observations on gender, aristocratic title, and whether the individual received their secondary education from Eton. Do this for the raw data, and then for the matched data with by regressing each covariate on the treatment variable. Code Hint: For the matched data, you need to specify the weights.

Apply the summary() function on the output from matchit() and create a plot of the standardised mean differences. Code Hint: Have a look at the helpfile with ?plot.summary.matchit.

Rematch the data, this time expanding the list of covariates to include all of the schooling, university and occupation categories. Use exact matching again. What is the ATT? How many observations remain in the matched data?

Eggers and Hainmueller run the matching analysis separately for Labour and Conservative candidates. Do this now and try to replicate the 3\(^{rd}\) and 6\(^{th}\) columns from table 3 of their paper. As you will see in the paper, they use M=1 genetic matching but you can get quite close to this result using a nearest neighbour 1:1 matching with replacement and mahalanobis distance. Adapt the code given below to do this. In their analysis, they do not only match on the categorical variables we have used so far, they also use the continuous variables for year of birth and year of death (yob,yod). Include these in the new matching procedure.

## 1:1 NN matching with mahalanobis distance and replacement
match.output <- matchit(treatment ~ covariates, 
                        data = data,
                        method = "nearest",
                        distance = "mahalanobis",
                        estimand = "ATT",
                        ratio = 1,
                        replace = T)

3.1.2 National Supported Work – Dehejia and Wahba (1999)

The National Supported Work programme was a federal programme in the US which provided work experience to individuals who had faced economic problems in the past. Individuals were randomly assigned to participate in the programme from a pool of applicants between 1975 and 1977, and both treatment and control individuals were interviewed in 1978 and asked about the amount of money they were currently earning.

The file NSW.dw.obs.dta (seminar data 2, above) contains data from from an observational study constructed by Lalonde (1986). The constructed observational study was formed by replacing the randomized control group with a comparison group formed using data from two national public surveys. The idea here is to see whether it is possible to use this observational data to construct unbiased estimates of the effect of the NSW treatment by implementing a matching estimator.

Dehejia and Wahba (1999) used these data to evaluate the potential efficacy of matching. In this task, you will replicate some of Dehejia and Wahba’s findings. The following variables are included in the NSW observational dataset (NSW.dw.obs.dta):

treat (1 = experimental treatment group; 0 = observational comparison group)
age (age in years)
educ (years of schooling)
black (1 if black; 0 otherwise)
hisp (1 if Hispanic; 0 otherwise)
married (1 if married; 0 otherwise)
nodegree (1 if no high school diploma; 0 otherwise)
educcat (1 = did not finish high school, 2 = high school, 3 = some college, 4 = graduated from college)
re74, re75, re78 (real earnings in 1974, 1975 and 1978)

You can load the data file, which is in STATA format, using the read.dta() function:

nsw_obs <- read.dta("data/NSW.dw.obs.dta")

Estimate the difference in means between treatment and control groups using a t-test or a bivariate linear regression. What is the difference in means? Is this likely to represent an unbiased estimate of the average treatment effect? Why, or why not?

Estimate the average treatment effect on the treated using the observational data using matching. You will need to decide which distance measure to use; which variables to include; which choice of M to use; whether to use matching with or without replacement, and so on. Try several specifications and when you are happy, estimate the causal effect.

Visit this webpage and submit your answers to the previous question.

Redo the matching exercise, excluding the variables for pre-treatment earnings in 1974 and 1975 (or including those variables if they were not in your original specification). How important do the earnings in 1974 and 1975 variables appear to be in terms of satisfying the selection on observables assumption?