7 Instrumental Variables (I)

Aside from experiments, all of the strategies covered up to this point rely on the researcher being able to control for confounding factors when estimating causal effects. For the next two weeks, we focus on a strategy – instrumental variables – which can be used to address unobserved confounding factors in the context of cross-sectional data (i.e. when we can’t use the panel data methods discussed in previous weeks). This week, we will motivate instrumental variable (IV) methods, by discussing how this strategy can be useful in the context of experimental data where some units fail to comply with the treatment.

This week, we focus on IV as a strategy for dealing with non-compliance (one-sided and two-sided) in randomized experiments. The clearest exposition of the idea of non-compliance is given in the two chapters of the Gerber and Green textbook (chapter 5 and chapter 6), although note that they generally avoid any mention of the phrase “instrumental variable” (I don’t really know why. Some of the linguistic decisions in that book are a little idiosyncratic.). The discussion of IV estimators is also very good (and short) in the Sovey and Green paper. This paper also emphasises the point that most applications in political science do not use IV for addressing non-compliance in randomized experiments (the case we discuss this week), but instead are applications where researchers use IV as a method for overcoming selection bias in cross-sectional observational studies (the case we will discuss next week). The “reader’s checklist” they provide at the end of the paper is particularily recommended, as it gives good, straightforward advice to anyone thinking about using IV as an estimation strategy for causal effects.

The chapter on IV estimation in MHE is good, though much of the material is beyond the level you (or anyone else, to be honest) will need to implement a good IV design. The treatment in the Mastering ’Metrics book is somewhat easier, and also includes some very interesting applications of IV when used to address non-compliance.

7.1 Seminar

7.1.1 Children’s Television and Educational Performance

Sesame Street

Can educational television programmes improve children’s learning outcomes? Sesame Street is an American television programme aimed at young children. The creators of Sesame Street decided from the very beginning of the show’s production that a central goal would to be educate as well as entertain its audience. As Malcolm Gladwell argued, “Sesame Street was built around a single, breakthrough insight: that if you can hold the attention of children, you can educate them”. In addition to building the show around a carefully constructed educational curriculum, the show’s producers also worked closely with educational researchers to determine whether the show’s content was effectively improving its young viewers’ numeracy and literacy skills.

The dataset contained in sesame_experiment.dta includes information on 240 children who were randomly assigned to two groups. The treatment of interest here is watching Sesame Street, but clearly it is not possible to force children to watch a TV show or (perhaps even harder) to refrain from watching, and so watching the show cannot be randomized. Instead, in this study, researchers randomized whether children were encouraged to watch the show. More specifically, when the study was run in the 1970s, Sesame Street was on the air each day between 9am and 10am. The parents of children in the treatment group were encouraged to show Sesame Street to their children on a regular basis, while parents of the children in the control group were given no such encouragement. Because it is only encouragement that is randomized here, there is the possibility of non-compliance – i.e. some children will not watch Sesame Street even though they are in the treatment condition, and some children will watch Sesame Street even though they are in the control condition. The data is is .dta format and you can load it as follows:

library(foreign)
sesame <- read.dta("sesame_experiment.dta")

The data includes the following variables:

encour – 1 if the child was encouraged to watch Sesame Street, 0 otherwise
watched – 1 if the child watched Sesame Street regularly, 0 otherwise
letters – the score of the child on a literacy test
age – age of the child (in months)
female – 1 if the child is female, 0 otherwise

For this seminar, you will also need the AER and ivdesc packages:

# install.packages(c("AER","remotes"))
# remotes::install_github("sumtxt/ivdesc/R/ivdesc")
library(AER)
library(ivdesc)

1. Compliance and the intention-to-treat

In the context of this specific example, define the following unit types:

Compliers

Always-takers

Never-takers

Defiers

The children who would watch Sesame Street only when encouraged to do so, and would not watch Sesame Street only when not encouraged
The children who would watch Sesame Street regardless of encouragement
The children who would not watch Sesame Street regardless of encouragement
The children who would not watch Sesame Street only when encouraged to do so, and would watch Sesame Street only when not encouraged

The fourth type, the defiers, are assumed not to exist. This is the monotonicity assumption.

Calculate the proportion of children in the treatment group who did not watch Sesame Street. Calculate the proportion of children in the control group who did watch Sesame Street. What type of non-compliance occured in this experiment? Hint: You might find the table() and prop.table() functions helpful here.

# counts in each assignment/treatment group
table("encouraged" = sesame$encouraged, "watched" = sesame$watched)

##           watched
## encouraged   0   1
##          0  40  48
##          1  14 138

# proportions in each assignment/treatment group
round(
  prop.table(
    table("encouraged" = sesame$encouraged, "watched" = sesame$watched),1),2)

##           watched
## encouraged    0    1
##          0 0.45 0.55
##          1 0.09 0.91

Of the 88 children assigned to the control condition, 48 actually watched Sesame Street.
Of the 152 children assigned to the treatment condition, 14 did not watch Sesame Street.

In addition to the fact that clearly Sesame Street was a very popular programme in the 1970s, this analysis tells us that we have two-sided non-compliance in this experiment. A number of treated units failed to take the assigned treatment, and a number of units took the treatment even though they were assigned to the control group.

Calculate the proportion of compliers in this experiment. Which assumptions are required for us to identify this quantity?

We can calculate the proportion of compliers via \(E[D_i|Z_i = 1] - E[D_i|Z_i = 0] = \bar{D}_{Z_i = 1} - \bar{D}_{Z_i = 0}\)

d_z_1 <- mean(sesame$watched[sesame$encouraged == 1])
d_z_0 <- mean(sesame$watched[sesame$encouraged == 0])
proportion_compliers <- d_z_1 - d_z_0
proportion_compliers

## [1] 0.3624402

# with OLS regression (= first stage regression)
coef(lm(watched ~ encouraged, data=sesame))

## (Intercept)  encouraged 
##   0.5454545   0.3624402

Roughly 36% of respondents in the sample are compliers. We require 2 assumptions to identify the proportion of compliers.

No defiers – we have to rule out any defiers from the sample
Independence of the instrument – we assume that the instrument (\(Z_i\), whether a child was encouraged to watch Sesame Street) is randomly assigned.
First stage

This assumption allows us to infer that the proportion of always-takers in the control group (something that is observable) is equal to the proportion of always-takers in the treatment group (something that is not observable).

Calculate the Intention-to-Treat effect (ITT). What is the interpretation of the ITT here?

We can calculate the ITT via \(E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0] = \bar{Y}_{Z_i = 1} - \bar{Y}_{Z_i = 0}\)

# Using the difference in means:
y_z_1 <- mean(sesame$letters[sesame$encouraged == 1])
y_z_0 <- mean(sesame$letters[sesame$encouraged == 0])
itt <- y_z_1 - y_z_0
itt

## [1] 2.875598

# Using OLS regression (= reduced form regression)
coef(lm(letters ~ encouraged, data = sesame))

## (Intercept)  encouraged 
##   24.920455    2.875598

The ITT estimate is equal to 2.876.

The ITT estimates the causal effect of treatment assignment on the outcome of interest. Here, the ITT estimates the causal effect of being encouraged to watch Sesame Street on a child’s score on the literacy test. The estimate of 2.876 implies that this encouragement increases literacy scores by nearly 3 points, on average. Note however that this effect is not very precisely estimated (p = 0.109) and does not represent a substantively large effect (the standard deviation of the outcome variable here is a little over 13).

2. Local Average Treatment Effect (LATE)

What does the LATE estimate?

The LATE estimates the average effect of the treatment on the outcome for those units in the sample who complied with the encouragement.

Estimate the LATE for this example. You should do this in three ways (all of which can be found on the lecture slides!):

Using the Wald estimator.

Using “manual” two-stage least squares (i.e. you need to specify the regressions yourself).

Using ivreg from the AER package.

## Wald Estimator
itt/proportion_compliers

## [1] 7.933993

# Equivalently
first_stage <- lm(watched ~ encouraged, data = sesame)
reduced_form <- lm(letters ~ encouraged, data = sesame)
coef(reduced_form)[2]/coef(first_stage)[2]

## encouraged 
##   7.933993

## Two stage least squares (manual)
first_stage <- lm(watched ~ encouraged, data = sesame)
sesame$fitted_d <- predict(first_stage)
second_stage <- lm(letters ~ fitted_d, data = sesame)
summary(second_stage)$coefficients[2,]

##   Estimate Std. Error    t value   Pr(>|t|) 
##  7.9339934  4.9267758  1.6103825  0.1086398

## Two stage least squares (IV reg)
library(AER)
tsls_ivreg <- ivreg(formula = letters ~ watched,
                instruments = ~ encouraged,
                data = sesame)

summary(tsls_ivreg)$coefficients[2,]

##   Estimate Std. Error    t value   Pr(>|t|) 
## 7.93399340 4.60580156 1.72260860 0.08625833

The LATE estimate is equal to 7.934.

The estimate tells us that the causal effect of watching Sesame Street for those children who complied with the encouragement increases a child’s score on the literacy test by nearly 9 points, on average. This estimate is much larger than the ITT, and (as evidenced by the standard error, t-value, and p-value from the ivreg summary) the effect is significantly different from zero at the 90% confidence level. (Note that the manual two-stage least squares approach does not yield the correct standard errors (here they are too big) and therefore you should not make statistical inferences based on the p-values obtained this way.)

You have now estimated two treatment effects: the ITT and LATE. Which is of greater interest to the TV show’s producers?

The LATE seems like a much more valuable quantity of interest to the producers of Sesame Street than the ITT. Because the ITT combines information on both the extent to which the treatment was adhered to by the respondents, and the effect of the treatment itself on the outcome, it obscures clear conclusions about the effectiveness of Sesame Street as an educational programme.

By contrast, the LATE gives a very clear answer: it tells us that, for those children who complied with the encouragement to watch or not watch Sesame Street, the causal effect of watching the programme was to increase their literacy skills by 8 points on average. From the point of view of the TV producers, this is helpful information as it directly informs them about the educational impact of their show on those who watch. Of course, it may be the case that the compliers in this example are very different from the always-takers or never-takers, and so the generalizability of this result cannot be established from this single experiment.

3. Exclusion restriction

What does the assumption of the exclusion restriction mean in this example? Are you convinced that the exclusion restriction holds here?

The exclusion restriction states that the instrument, Z, can only affect the outcome, Y, through its affect on the treatment, D. Here, this implies that for those children whose behaviour would not have been changed by the encouragement (i.e. never-takers and always-takers), there can be no effect of the encouragement on outcomes. In other words, there is no effect of encouragement on learning outcomes aside from when encouragement successfully prompts children to watch Sesame Street.

It seems likely that the exclusion restriction is a reasonable assumption in this setting. If the parents of a child are encouraged to sit their child in front of Sesame Street, it is difficult to think of a way that that assignment might affect their child’s literacy skills other than if they actually comply with the treatment.

4. Characterising the compliers

Use the ivdesc() function from the ivdesc package to evaluate differences between compliers, always-takers, and never-takers in this sample. What is mean age of compliers? What fraction of compliers are female? Are compliers significantly different from other types of units with respect to these covariates?

ivdesc(X = sesame$age, Z = sesame$encouraged, D = sesame$watched)

group	mu	mu_se	pi	pi_se
sample	51.52500	0.4127389	1.0000000	0.0000000
co	50.11353	1.4263739	0.3624402	0.0583696
nt	52.78571	1.5691672	0.0921053	0.0248412
at	52.25000	0.9113153	0.5454545	0.0520890

Bootstrapped p-values:

group	Pr(T<t)	Pr(T>t)
co_vs_nt	0.9	0.1
co_vs_at	0.868	0.132
at_vs_nt	0.593	0.407

Balance test: H0: E[X|Z=0]=E[X|Z=1] Pr(|T| > |t|) = 0.72

ivdesc(X = sesame$female, Z = sesame$encouraged, D = sesame$watched)

group	mu	mu_se	pi	pi_se
sample	0.5208333	0.0325297	1.0000000	0.0000000
co	0.4898240	0.1039806	0.3624402	0.0584736
nt	0.6428571	0.1331378	0.0921053	0.0243777
at	0.5208333	0.0693442	0.5454545	0.0531375

Bootstrapped p-values:

group	Pr(T<t)	Pr(T>t)
co_vs_nt	0.811	0.189
co_vs_at	0.589	0.411
at_vs_nt	0.783	0.211

Balance test: H0: E[X|Z=0]=E[X|Z=1] Pr(|T| > |t|) = 0.756

Children who comply with the encouragement are somewhat younger (50.1 months) than either never-takers (52.7 months) or always-takers (52.3 months).

A smaller fraction of compliers are female (49%) than is true for always-takers (52%) or never-takers (64%).

However, in neither case is there sufficient evidence to reject the null hypothesis of no difference between these groups of units. The p-values are all relatively large on all of the pairwise comparisons.

Overall, although the compliers are somewhat different from the always- and never-takers, the local average treatment effect is unlikely to be very different from the overall average treatment effect.

7.1.2 Estimating the Impact of The Hajj

Clingingsmith, Khwaja and Kremer (2009) estimate the impact on pilgrims of performing the Hajj pilgrimage to Mecca using an instrumental variables approach. They compare successful and unsuccessful applicants in a randomized lottery used by Pakistan to allocate Hajj visas and examine the impact of the Hajj pilgrimage on the subsequent beliefs and values of Pakistani Muslims.

You can download the data for this part of the assignment from the top of the page, and load it using the following command:

load("data/hajjdata.Rdata")

You will again need the AER package for this problem:

library(AER)

The data object, hajj includes the following key variables:

moderacy, an index ranging from 0 to 4 constructed from opinion questions, where higher values indicate more moderate views on Islamic practices, Islamist terrorism, and the status of women
success = 1 if the respondent won the lottery for a Hajj visa, 0 otherwise
hajj2006 = 1 if the respondent went on the Hajj, 0 otherwise
age, measured in years
literate = 1 if respondent is literate, 0 otherwise 6 urban = 1 if respondent lives in an urban area, 0 otherwise

1. Calculating non-compliance

Calculate (i) the proportion of people who won the lottery and did not go on the Hajj and (ii) the proportion of people who lost the lottery and went on the Hajj. Using these answers, what type of non-compliance occurred in this natural experiment?

# assigned to control, went on hajj
length(hajj$hajj2006[hajj$success==0 & hajj$hajj2006==1])/length(hajj$hajj2006[hajj$success==0])

## [1] 0.1373333

# assigned to treatment, didn’t go on hajj
length(hajj$hajj2006[hajj$success==1 & hajj$hajj2006==0])/length(hajj$hajj2006[hajj$success==1])

## [1] 0.008187135

13.7% of lottery losers went on the Hajj, while just 0.82% of lottery winners did not go. Therefore this is (just) two-sided non-compliance, but non-compliance was very rare amongst those assigned to treatment.

In this study, who are the compliers and who are the always-takers?

The compliers are people who always go on the Hajj when they win the visa lottery and always don’t go on the Hajj when they lose it. The always-takers are people who always go on the Hajj regardless of their lottery outcome.

2. Calculating the ITT and the LATE

Calculate the ITT, using moderacy as the outcome variable. What does the ITT represent in this example?

itt <- mean(hajj$moderacy[hajj$success==1]) - 
  mean(hajj$moderacy[hajj$success==0])
itt

## [1] 0.1065497

The intent-to-treat effect is 0.107, meaning that winning the visa lottery caused a 0.107- point increase in moderacy along the four-point scale.

Calculate the proportion of compliers and the LATE in this example. Interpret your results.

proportion_compliers <- 
  sum(
    hajj$hajj2006[hajj$success==1])/length(hajj$hajj2006[hajj$success==1]) - 
  sum(
    hajj$hajj2006[hajj$success==0])/length(hajj$hajj2006[hajj$success==0])
late <- itt/proportion_compliers

late

## [1] 0.1246954

The proportion of compliers is 0.854, meaning that 85.4% of people in this study are compliers. The LATE is 0.125, meaning that amongst the compliers, going on the Hajj causes an increase in moderacy of 0.125 points.

Calculate the local average treatment effect (LATE) using two-stage least squares and verify that your answer is identical to part (c). Report its standard error. Is the LATE statistically significant?

iv_out <- ivreg(formula = moderacy ~ hajj2006,
                instruments = ~success, 
                data=hajj)

summary(iv_out)$coefficients[2,]

##    Estimate  Std. Error     t value    Pr(>|t|) 
## 0.124695447 0.040491161 3.079572010 0.002108187

As expected, the result is identical to (b). The standard error is 0.04 and the p-value ist 0.002, meaning that the LATE is statistically significant at the 99% confidence level.

7.2 Quiz

What is an instrumental variable useful for, in the context of an experiment?

To increase precision of our estimates and compute smaller standard errors
To overcome an issue of selection bias due to non-random treatment intake (non-compliance)
To randomly assign treatment intake to the sample units
To estimate counterfactuals for treated units as weighted averages of untreated ones

What does the intention to treat effect (ITT) represent?

The average effect of assigning units to treatment
The average effect of receiving the treatment
The average effect of receiving the treatment for units in the treatment group
The individual treatment effect for the treated unit

What does the local average treatment effect (LATE) represent?

The average treatment effect for units that are spatially clustered
The average treatment effect regardless of units’ treatment assignment
The average treatment effect for units who comply with their treatment assignment
The average treatment effect for units who defy their treatment assignment

What assumptions don’t we need, in order to estimate a LATE with a Wald estimator or 2SLS?

That the instrumental variable affects the outcome variable only through the treatment variable (= Exclusion restriction)
That units would all have been under the control condition, had they not received a treatment assignment
That the assignment to a condition under the instrument is independent of potential outcomes and potential treatment status (= Independence assumption)
That no unit behaves in such a way that its treatment status is the opposite of its treatment assignment because of its treatment assignment (= Monotonicity assumption/No defiers)
That treatment assignment has at least some effect on treatment status (= First stage assumption)

How can we make sure that the LATE is a meaningful quantity in an experiment?

The LATE is most often not a meaningful quantity. We’d better estimate the ITT
We can check whether distribution of covariates in the treatment group is similar to that in the control group and overall sample
We can check whether distribution of covariates in the complier group is similar to that in the other groups and overall sample
We can check whether distribution of covariates in the defier group is similar to that in the other groups and overall sample

Which one of the following is NOT an advantage of a 2SLS estimator, as opposed to a Wald estimator?

2SLS allows us to include covariates in our first and second stages
2SLS allows us to use a non-binary instrumental variable
2SLS allows us to estimate a LATE even when the exclusion restriction assumption is likely not met
2SLS allows us to employ more than just one instrumental variable in our identification strategy