7 Instrumental Variables (I)

Aside from experiments, all of the strategies covered up to this point rely on the researcher being able to control for confounding factors when estimating causal effects. For the next two weeks, we focus on a strategy – instrumental variables – which can be used to address unobserved confounding factors in the context of cross-sectional data (i.e. when we can’t use the panel data methods discussed in previous weeks). This week, we will motivate instrumental variable (IV) methods, by discussing how this strategy can be useful in the context of experimental data where some units fail to comply with the treatment.

This week, we focus on IV as a strategy for dealing with non-compliance (one-sided and two-sided) in randomized experiments. The clearest exposition of the idea of non-compliance is given in the two chapters of the Gerber and Green textbook (chapter 5 and chapter 6), although note that they generally avoid any mention of the phrase “instrumental variable” (I don’t really know why. Some of the linguistic decisions in that book are a little idiosyncratic.). The discussion of IV estimators is also very good (and short) in the Sovey and Green paper. This paper also emphasises the point that most applications in political science do not use IV for addressing non-compliance in randomized experiments (the case we discuss this week), but instead are applications where researchers use IV as a method for overcoming selection bias in cross-sectional observational studies (the case we will discuss next week). The “reader’s checklist” they provide at the end of the paper is particularily recommended, as it gives good, straightforward advice to anyone thinking about using IV as an estimation strategy for causal effects.

The chapter on IV estimation in MHE is good, though much of the material is beyond the level you (or anyone else, to be honest) will need to implement a good IV design. The treatment in the Mastering ’Metrics book is somewhat easier, and also includes some very interesting applications of IV when used to address non-compliance.

7.1 Seminar

7.1.1 Children’s Television and Educational Performance

Sesame Street
Sesame Street

Can educational television programmes improve children’s learning outcomes? Sesame Street is an American television programme aimed at young children. The creators of Sesame Street decided from the very beginning of the show’s production that a central goal would to be educate as well as entertain its audience. As Malcolm Gladwell argued, “Sesame Street was built around a single, breakthrough insight: that if you can hold the attention of children, you can educate them”. In addition to building the show around a carefully constructed educational curriculum, the show’s producers also worked closely with educational researchers to determine whether the show’s content was effectively improving its young viewers’ numeracy and literacy skills.

The dataset contained in sesame_experiment.dta includes information on 240 children who were randomly assigned to two groups. The treatment of interest here is watching Sesame Street, but clearly it is not possible to force children to watch a TV show or (perhaps even harder) to refrain from watching, and so watching the show cannot be randomized. Instead, in this study, researchers randomized whether children were encouraged to watch the show. More specifically, when the study was run in the 1970s, Sesame Street was on the air each day between 9am and 10am. The parents of children in the treatment group were encouraged to show Sesame Street to their children on a regular basis, while parents of the children in the control group were given no such encouragement. Because it is only encouragement that is randomized here, there is the possiblity of non-compliance – i.e. some children will not watch Sesame Street even though they are in the treatment condition, and some children will watch Sesame Street even though they are in the control condition. The data is is .dta format and you can load it as follows:

sesame <- read.dta("sesame_experiment.dta")

The data includes the following variables:

  1. encour – 1 if the child was encouraged to watch Sesame Street, 0 otherwise
  2. watched – 1 if the child watched Sesame Street regularly, 0 otherwise
  3. letters – the score of the child on a literacy test
  4. age – age of the child (in months)
  5. female – 1 if the child is female, 0 otherwise

For this seminar, you will also need the AER and ivdesc packages:

#install.packages(c("AER", "ivdesc"))

1. Compliance and the intention-to-treat

  1. In the context of this specific example, define the following unit types:
    1. Compliers
    2. Always-takers
    3. Never-takers
    4. Defiers
  1. The children who would watch Sesame Street only when encouraged to do so, and would not watch Sesame Street only when not encouraged
  2. The children who would watch Sesame Street regardless of encouragement
  3. The children who would not watch Sesame Street regardless of encouragement
  4. The children who would not watch Sesame Street only when encouraged to do so, and would watch Sesame Street only when not encouraged

The fourth type, the defiers, are assumed not to exist. This is the monotonicity assumption.

  1. Calculate the proportion of children in the treatment group who did not watch Sesame Street. Calculate the proportion of children in the control group who did watch Sesame Street. What type of non-compliance occured in this experiment? Hint: You might find the table() and prop.table() functions helpful here.
# counts in each assignment/treatment group
table(sesame$encouraged, sesame$watched)
##       0   1
##   0  40  48
##   1  14 138
# proportions in each assignment/treatment group
prop.table(table(sesame$encouraged, sesame$watched),1)
##              0          1
##   0 0.45454545 0.54545455
##   1 0.09210526 0.90789474
  • Of the 88 children assigned to the control condition, 48 actually watched Sesame Street.
  • Of the 152 children assigned to the treatment condition, 14 did not watch Sesame Street.

In addition to the fact that clearly Sesame Street was a very popular programme in the 1970s, this analysis tells us that we have two-sided non-compliance in this experiment. A number of treated units failed to take the assigned treatment, and a number of units took the treatment even though they were assigned to the control group.

  1. Calculate the proportion of compliers in this experiment. Which assumptions are required for us to identify this quantity?

We can calculate the proportion of compliers via \(E[D_i|Z_i = 1] - E[D_i|Z_i = 0] = \bar{D}_{Z_i = 1} - \bar{D}_{Z_i = 0}\)

d_z_1 <- mean(sesame$watched[sesame$encouraged == 1])
d_z_0 <- mean(sesame$watched[sesame$encouraged == 0])
proportion_compliers <- d_z_1 - d_z_0
## [1] 0.3624402
# with OLS regression (= first stage regression)
coef(lm(watched ~ encouraged, data=sesame))
## (Intercept)  encouraged 
##   0.5454545   0.3624402

Roughly 36% of respondents in the sample are compliers. We require 2 assumptions to identify the proportion of compliers.

  1. No defiers – we have to rule out any defiers from the sample
  2. Independence of the instrument – we assume that the instrument (\(Z_i\), whether a child was encouraged to watch Sesame Street) is randomly assigned.

This assumption allows us to infer that the proportion of always-takers in the control group (something that is observable) is equal to the proportion of always-takers in the treatment group (something that is not observable).

  1. Calculate the Intention-to-Treat effect (ITT). What is the interpretation of the ITT here?

We can calculate the ITT via \(E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0] = \bar{Y}_{Z_i = 1} - \bar{Y}_{Z_i = 0}\)

# Using the difference in means:
y_z_1 <- mean(sesame$letters[sesame$encouraged == 1])
y_z_0 <- mean(sesame$letters[sesame$encouraged == 0])
itt <- y_z_1 - y_z_0
## [1] 2.875598
# Using OLS regression (= reduced form regression)
coef(lm(letters ~ encouraged, data = sesame)) 
## (Intercept)  encouraged 
##   24.920455    2.875598

The ITT estimate is equal to 2.876.

The ITT estimates the causal effect of treatment assignment on the outcome of interest. Here, the ITT estimates the causal effect of being encouraged to watch Sesame Street on a child’s score on the literacy test. The estimate of 2.876 implies that this encouragement increases literacy scores by nearly 3 points, on average. Note however that this effect is not very precisely estimated (p = 0.109) and does not represent a substantively large effect (the standard deviation of the outcome variable here is a little over 13).

2. Local Average Treatment Effect (LATE)

  1. What does the LATE estimate?

The LATE estimates the average effect of the treatment on the outcome for those units in the sample who complied with the encouragement.

  1. Estimate the LATE for this example. You should do this in three ways (all of which can be found on the lecture slides!):
    1. Using the Wald estimator.
    2. Using “manual” two-stage least squares (i.e. you need to specify the regressions yourself).
    3. Using ivreg from the AER package.
## Wald Estimator
## [1] 7.933993
# Equivalently
first_stage <- lm(watched ~ encouraged, data = sesame)
reduced_form <- lm(letters ~ encouraged, data = sesame)
## encouraged 
##   7.933993
## Two stage least squares (manual)
first_stage <- lm(watched ~ encouraged, data = sesame)
sesame$fitted_d <- predict(first_stage)
second_stage <- lm(letters ~ fitted_d, data = sesame)
##   Estimate Std. Error    t value   Pr(>|t|) 
##  7.9339934  4.9267758  1.6103825  0.1086398
## Two stage least squares (IV reg)
tsls_ivreg <- ivreg(formula = letters ~ watched,
                instruments = ~ encouraged,
                data = sesame)

##   Estimate Std. Error    t value   Pr(>|t|) 
## 7.93399340 4.60580156 1.72260860 0.08625833

The LATE estimate is equal to 7.934.

The estimate tells us that the causal effect of watching Sesame Street for those children who complied with the encouragement increases a child’s score on the literacy test by nearly 9 points, on average. This estimate is much larger than the ITT, and (as evidenced by the standard error, t-value, and p-value from the ivreg summary) the effect is significantly different from zero at the 90% confidence level. (Note that the manual two-stage least squares approach does not yield the correct standard errors (here they are too big) and therefore you should not make statistical inferences based on the p-values obtained this way.)

  1. You have now estimated two treatment effects: the ITT and LATE. Which is of greater interest to the TV show’s producers?

The LATE seems like a much more valuable quantity of interest to the producers of Sesame Street than the ITT. Because the ITT combines information on both the extent to which the treatment was adhered to by the respondents, and the effect of the treatment itself on the outcome, it obscures clear conclusions about the effectiveness of Sesame Street as an educational programme.

By contrast, the LATE gives a very clear answer: it tells us that, for those children who complied with the encouragement to watch or not watch Sesame Street, the causal effect of watching the programme was to increase their literacy skills by 8 points on average. From the point of view of the TV producers, this is helpful information as it directly informs them about the educational impact of their show on those who watch. Of course, it may be the case that the compliers in this example are very different from the always-takers or never-takers, and so the generalizability of this result cannot be established from this single experiment.

3. Exclusion restriction

  1. What does the assumption of the exclusion restriction mean in this example? Are you convinced that the exclusion restriction holds here?

The exclusion restriction states that the instrument, Z, can only affect the outcome, Y, through its affect on the treatment, D. Here, this implies that for those children whose behaviour would not have been changed by the encouragement (i.e. never-takers and always-takers), there can be no effect of the encouragement on outcomes. In other words, there is no effect of encouragement on learning outcomes aside from when encouragement successfully prompts children to watch Sesame Street.

It seems likely that the exclusion restriction is a reasonable assumption in this setting. If the parents of a child are encouraged to sit their child in front of Sesame Street, it is difficult to think of a way that that assignment might affect their child’s literacy skills other than if they actually comply with the treatment.

4. Characterising the compliers

  1. Use the ivdesc() function from the ivdesc package to evaluate differences between compliers, always-takers, and never-takers in this sample. What is mean age of compliers? What fraction of compliers are female? Are compliers significantly different from other types of units with respect to these covariates?
ivdesc(X = sesame$age, Z = sesame$encouraged, D = sesame$watched)
group mu mu_se pi pi_se
sample 51.52500 0.4059164 1.0000000 0.0000000
co 50.11353 1.3784031 0.3624402 0.0576833
nt 52.78571 1.5109488 0.0921053 0.0251480
at 52.25000 0.8988697 0.5454545 0.0516153

Bootstrapped p-values:

group Pr(T<t) Pr(T>t)
co_vs_nt 0.902 0.098
co_vs_at 0.867 0.133
at_vs_nt 0.604 0.394

Balance test: H0: E[X|Z=0]=E[X|Z=1] Pr(|T| > |t|) = 0.72

ivdesc(X = sesame$female, Z = sesame$encouraged, D = sesame$watched)
group mu mu_se pi pi_se
sample 0.5208333 0.0327572 1.0000000 0.0000000
co 0.4898240 0.1038713 0.3624402 0.0580338
nt 0.6428571 0.1301049 0.0921053 0.0237566
at 0.5208333 0.0706316 0.5454545 0.0531868

Bootstrapped p-values:

group Pr(T<t) Pr(T>t)
co_vs_nt 0.813 0.187
co_vs_at 0.594 0.406
at_vs_nt 0.799 0.197

Balance test: H0: E[X|Z=0]=E[X|Z=1] Pr(|T| > |t|) = 0.756

Children who comply with the encouragement are somewhat younger (50.1 months) than either never-takers (52.7 months) or always-takers (52.3 months).

A smaller fraction of compliers are female (49%) than is true for always-takers (52%) or never-takers (64%).

However, in neither case is there sufficient evidence to reject the null hypothesis of no difference between these groups of units. The p-values are all relatively large on all of the pairwise comparisons.

Overall, although the compliers are somewhat different from the always- and never-takers, the local average treatment effect is unlikely to be very different from the overall average treatment effect.

7.1.2 Estimating the Impact of The Hajj

Clingingsmith, Khwaja and Kremer (2009) estimate the impact on pilgrims of performing the Hajj pilgrimage to Mecca using an instrumental variables approach. They compare successful and unsuccessful applicants in a randomized lottery used by Pakistan to allocate Hajj visas and examine the impact of the Hajj pilgrimage on the subsequent beliefs and values of Pakistani Muslims.

You can download the data for this part of the assignment from the top of the page, and load it using the following command:


You will again need the AER package for this problem:


The data object, hajj includes the following key variables:

  1. moderacy, an index ranging from 0 to 4 constructed from opinion questions, where higher values indicate more moderate views on Islamic practices, Islamist terrorism, and the status of women
  2. success = 1 if the respondent won the lottery for a Hajj visa, 0 otherwise
  3. hajj2006 = 1 if the respondent went on the Hajj, 0 otherwise
  4. age, measured in years
  5. literate = 1 if respondent is literate, 0 otherwise 6 urban = 1 if respondent lives in an urban area, 0 otherwise

1. Calculating non-compliance

  1. Calculate (i) the proportion of people who won the lottery and did not go on the Hajj and (ii) the proportion of people who lost the lottery and went on the Hajj. Using these answers, what type of non-compliance occurred in this natural experiment?
  1. In this study, who are the compliers and who are the always-takers?

2. Calculating the ITT and the LATE

  1. Calculate the ITT, using moderacy as the outcome variable. What does the ITT represent in this example?
  1. Calculate the proportion of compliers and the LATE in this example. Interpret your results.
  1. Calculate the local average treatment effect (LATE) using two-stage least squares and verify that your answer is identical to part (c). Report its standard error. Is the LATE statistically significant?