4 Selection on Observables (II)

This week we continue to consider methods that rely on a selection-on-observables assumption for making causal inferences from non-experimental data. In particular, this week we focus on assessing under which conditions linear regression can be used to make causal statements.

In lecture this week we revisited the linear regression model, and discussed in particular the circumstances under which regression coefficients resemble the average treatment effects that are our core concern on this course. If you need a refresher on linear regression, then chapter 2 in the Mastering ’Metrics book is excellent. In addition to covering the basics of regression analysis, it provides a very accessible treatment of omitted variable bias (p. 68 - 74 and p. 89 - 90), as well as helpful summaries of regression with logged dependent variables (p. 93 - 94) and uncertainty in regression analysis (p. 95 -96). All of these topics are also covered in the MHE book (chapter 3), though at a higher level of mathematical detail.

We spent a reasonable amount of time describing the implicit conditional-treatment-weighting that is imposed by OLS when calculating treatment effects while conditioning on observed covariates. The MHE book provides a derivation and explanation of this result on pages 73 to 76, and there is a similar treatment in the Morgan and Winship book (chapter 6). One of the implications of this implicit weighting scheme is that the effective sample on which the results of any observational regression are based may be very different from the full population of interest; a result that is clearly and cleanly articulated in the Aronow and Samii, 2016 paper (we strongly recommend this paper!). We also recommend the paper by Ho et. al (2007) which gives an excellent (and intuitive) account of how combining matching with regression analysis can reduce model dependence in causal analyses.

This week we will be learning how to use linear regression to estimate causal effects (under the selection on observables assumption) in R. We will also learn how to specify non-linear models for certain explanatory variables by using polynomials, and how to construct heteroskedasticity-robust standard errors. Finally, we will also learn some functions for interpreting the output of our models.

4.1 Seminar


We will continue to use the use the Eggers and Hainmueller (2009) data and the Dehejia and Wahba (1999) observational data that we introduced last week. If you didn’t complete the previous problem set, or if you are working from a new computer, please download those datasets now using the links from the top of the page.


As with last week, we will need to use the foreign package, so you should begin by loading that at the top of your script. You will also need a number of new packages this week:

  1. texreg, which can be used to easily output nicely formatted regression tables.
  2. estimatr, which can be used to calculate models various types of robust standard errors.
  3. lmtest, a package which contains further functions to help calculate alternative standard errors.

You will need to install and then load these as well. Finally, and as ever, you should also set your working directory to the correct location.

install.packages(c("texreg", "estimatr", "lmtest")) # Remember, you only need to run this line once and then the code can be deleted!


4.1.1 MPs for Sale? (Revisited) – Eggers and Hainmueller (2009)

Recall that Eggers and Hainmueller are interested in establishing whether there is a causal effect of holding political office on the personal wealth of an individual. Last week we estimated these effects using various forms of matching. This week we will investigate similar claims using regression.

You should begin by loading the data:


As a reminder of some of the material we covered in week 2, linear regression models are estimated in R using the lm function. The main arguments for which are as follows:

Argument Description
formula The formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable.1 + independent.variable.2
data This is the name of the dataset that contains the variable of interest

Remember, if you would like more information on how the lm() function works, type help(lm) or ?lm in R.

Fitting a Linear Regression Model

  1. Fit a simple linear regression model for the mps data, where our outcome variable is the logged wealth at death of an individual (lnrealgross), and the predictor variable is whether the individual attended an Oxbridge university (ucat_ox). Interpret the two coefficients that you estimate.
oxbridge_model <- lm(lnrealgross ~ ucat_ox, mps)
## Call:
## lm(formula = lnrealgross ~ ucat_ox, data = mps)
## Coefficients:
## (Intercept)      ucat_ox  
##     12.5084       0.3936

The intercept represents the mean level of the (log) wealth of individuals who did not go to an Oxbridge university, and the coefficient on the ucat_ox variable represents the estimated difference in means. The mean level of log wealth for individuals from Oxford is therefore given by the sum of the two coefficients.

  1. Once we have estimated a regression model, we can use that model to produce fitted values. Recall that the formula for fitted values is: \[\hat{Y}_{i} = \hat{\alpha} + \hat{\beta}_1 * X_i\] Using this formula and the model we estimated in question 1, calculate the 2 fitted values by hand: one for individuals who did study at an Oxbridge university and one for individuals who did not study at an Oxbridge university.
  • Did study at Oxbridge: \[\hat{Y}_{i} = 12.5084 + 0.3936 * 1 = 12.90196\]
  • Did not study at Oxbridge: \[\hat{Y}_{i} = 12.5084 + 0.3936 * 0 = 12.50836\]
  1. Use the predict function to calculate the same fitted values we calculated by hand above. Code Hint: As we add more variables to our model, fitted values calculations become more laborious and so we will use R instead. To calculate fitted values in R, we use the predict() function. The predict function takes two main arguments.
Argument Description
object The object is the model object that we would like to use to produce fitted values.
newdata This is an optional argument which we use to specify the values of our independent variable(s) that we would like fitted values for. If we leave this argument empty, R will automatically calculate fitted values for all of the observations in the data that we used to estimate the original model. If we include this argument, we need to provide a data.frame which has variables that have the same name as the independent variables in our model
fitted_oxbridge <- predict(oxbridge_model, 
                           newdata = data.frame(ucat_ox = c(0,1)))
##        1        2 
## 12.50836 12.90196

This is the same as the result we obtained when we calculated the fitted value manually. The good thing about the predict() function, however, is that we will be able to use it for all the models we study on this course, and it can be useful for calculating many different fitted values.

  1. Note that the dependent variable here is measured in log wealth at death. Therefore, in order to interpret these fitted values, we might want to convert them into a more meaningful quantity by exponentiating the resulting values. Do this now using the exp() function. How does the coefficient estimate for \(\hat{\beta}\) relate to these fitted values?
##        1        2 
## 270590.9 401096.0

This suggests that individuals who received an Oxbridge education had a total wealth at death of approximately £400,000, while people who did not attend Oxbridge had only approximately £270,000 at the time of their death.

Note also that we can link these exponentiated fitted values to the estimated value of the \(\hat{\beta}\) coefficient from the regression model. We know that the fitted values are calculated according to:

Individuals who did not study at Oxbridge: \[ln(\hat{Y}_{\text{Oxbridge} =0}) = \hat{\alpha}\]

Individuals who did study at Oxbridge: \[ln(\hat{Y}_{\text{Oxbridge} =1}) = \hat{\alpha} + \hat{\beta}\]

The difference in the fitted values is therefore equal to:

\[ln(\hat{Y}_{\text{Oxbridge} =1}) - ln(\hat{Y}_{\text{Oxbridge} =0}) = (\hat{\alpha} + \hat{\beta}) - \hat{\alpha} = \hat{\beta}\]

Recalling that the difference in the logs is the same as the log of the ratio, we have:

\[ln \left(\frac{\hat{Y}_{\text{Oxbridge} =1}}{\hat{Y}_{\text{Oxbridge} =0}}\right) = \hat{\beta}\]

Exponentiating both sides of the equaiton:

\[\frac{\hat{Y}_{\text{Oxbridge} =1}}{\hat{Y}_{\text{Oxbridge} =0}} = exp(\hat{\beta})\]

This shows that the exponentiated value of the estimated coefficient will be equal to the ratio of the fitted values. So, while the literal interpretation of the estimated coefficient \(\hat{\beta}\) is that a one-unit increase in X will produce an expected increase in log Y of \(\hat{\beta}\) units, in terms of Y itself, this means that the expected value of Y is multiplied by \(exp(\hat{\beta})\).

To check that this is correct, notice that if we multiply the exponentiated fitted value for individuals without an Oxbridge education by the exponentiated value of \(\hat{\beta}\), the resulting number is equal to the exponentiated fitted value of individuals with an Oxbridge education:

exp(fitted_oxbridge[1]) * exp(coef(oxbridge_model)[2])
##      1 
## 401096
##      2 
## 401096

Estimating the ATE with Regression

  1. Recall the matching exercise from question 5 from last week, where we matched on an extensive list of covariates before estimating our treatment effect. Using the same set of covariates, estimate the ATE of the treatment variable (treated) on log wealth at death (lnrealgross) using OLS. In particular, fit the model: \[Y_i = \hat{\alpha} + \hat{\beta_1}D_i + \hat{\gamma}X + u_i\] Where \(\hat{\alpha}\) is an intercept, \(\hat{\beta_1}\) is the OLS estimator for the ATE, and \(\hat{\gamma}\) is a vector of coefficients for the control variables. If we want to interpret the ATE estimate produced from this regression model as the causal effect, what is our identification assumption? Are your results different from those estimated in question 4 from last week? If so, why?
ate_ols_model <- lm(lnrealgross ~ treated + female + aristo +  
                              scat_pub +  scat_eto + scat_nm + scat_sec + 
                              oc_teacherall + oc_barrister + oc_solicitor +  oc_dr +
                              oc_civil_serv + oc_local_politics + oc_business +  
                              oc_white_collar +  oc_union_org + oc_journalist + 
                              oc_miner + ucat_ox + ucat_deg + ucat_nm, 
                data = mps)

# on log scale
##   treated 
## 0.4265225
# on linear scale
##  treated 
## 1.531921

The identifying assumption is the independence of treatment and potential outcomes conditional on the observed covariates, \(Y_{1i}, Y_{0i} || D_i|X_i\) (i.e. conditional independence assumption). We also need the common support assumption, \(0 < Pr(D_i = 1|X_i = x) < 1\). Note that these are the same assumptions as we needed last week for matching – both rely on the same identification assumptions, they are just different estimators.

  1. Estimate the ATE with exact 1:1 matching. Compare the estimates and their standard errors. Are they different and, if yes, what are possible reasons for it? Code Hint: Contrary to last week, where we focussed on the ATT, you need to specify estimand = "ATE".
match.out <- matchit(treated ~ female + aristo + scat_pub + scat_eto +
                       scat_nm + scat_sec + oc_teacherall + oc_barrister +
                       oc_solicitor + oc_dr + oc_civil_serv +
                       oc_local_politics + oc_business + oc_white_collar +
                       oc_union_org + oc_journalist + oc_miner + ucat_ox +
                       ucat_deg + ucat_nm, 
                     data = mps,
                     method = "exact",
                     estimand = "ATE",
                     ratio = 1) 
ate_match_model <-lm(lnrealgross ~ treated,
                     data = match.data(match.out),
                     weights = weights)

## output results

match <- get_model_data(ate_match_model,"est",terms = c("treated"))
ols <- get_model_data(ate_ols_model,"est",terms = c("treated"))

out <- round(data.frame("Regression" = c(ols$estimate,ols$std.error),
                        "Matching" = c(match$estimate,match$std.error)),2)
out[1,] <- paste0(out[1,],c(ols$p.stars,match$p.stars))
rownames(out) <- c("Estimate","Std. Error")
kable(out, booktabs =T) %>% kable_paper()
Regression Matching
Estimate 0.43*** 0.47***
Std. Error 0.11 0.13

Even though the results are boradly similar in substantive and statistical significance, there are not the same, and this could be for two reasons. First, the ATE could be different because of the way that OLS implicitly weights the units. As explained in detail in the MHE book (p. 73-76) OLS puts greater weight on the units that have higher conditional-treatment-variance in X, whereas matching estimators simply weight by the probability distributions of X in either the entire sample (ATE), the treatment group (ATT), or the control group (ATC). The second reason for the difference is that in the code above we use exact matching, which means that a number of our observations were dropped from the sample before we estimated the treatment effect (in fact, the matched number of observations is 266 vs. 425 in the unmatched data). OLS does not automatically drop units for which there are no perfect matches, and instead will extrapolate to “fill in” the missing potential outcomes for those values.

Nonetheless, the similarity between the two approaches emphasises the point that in many applied situations, subclassification, matching on covariates, and estimating a multiple regression model will give very similar estimates of the average treatment effect. This is because all of these estimation strategies have the same basic idea at heart: we condition on observables to try to make the treatment and control groups more comparable.

Ultimately, it is important to remember that the plausibility of a selection on observables design rests mainly on whether the selected observable confounders are enough to satisfy the Conditional Independence Assumption, and no remaining omitted variable bias remains. If important confounding variables are not included in the analysis, it will be of little consequence to causal inference which model specification you use, regardless of the chosen type of selection on observables apprach.

4.1.2 National Supported Work (Revisited) – Dehejia and Wahba (1999)

In this section we will continue using the observational data from the study by Dehejia and Wahba (1999). Recall that the authors use data relating to the the National Supported Work programme to assess the plausibility of reconstructing experimental results using observational data.

We will use the NSW.dw.obs.dta to learn about estimating non-linear relationships between our independent and dependent variables, and also to see how to correct our standard errors for the presence of heteroskedastic errors. Load the data using the read.dta() function from the foreign package:

nsw_obs <- read.dta("data/NSW.dw.obs.dta")

The data includes 18667 observations, and the following variables:

  • treat (1 = experimental treatment group; 0 = observational comparison group)
  • age (age in years)
  • educ (years of schooling)
  • black (1 if black; 0 otherwise)
  • hisp (1 if Hispanic; 0 otherwise)
  • married (1 if married; 0 otherwise)
  • nodegree (1 if no high school diploma; 0 otherwise)
  • educcat (1 = did not finish high school, 2 = high school, 3 = some college, 4 = graduated from college)
  • re74, re75, re78 (real earnings in 1974, 1975 and 1978)

Recall that our main outcome of interest in this data is the self-reported earnings of respondents in 1978, which is stored as re78 in the data. The main explanatory variable of interest is treat which is equal to 1 if the respondent was a participant in the NSW programme, and 0 otherwise.

Non-linearity in Regression Models

  1. Consider the other variables in the data - are there any variables that you would expect to have a non-linear association with earnings? Provide a theoretical account explaining why such non-linear relationships might exist.

There are at least two variables which might plausibly be thought to vary non-linearly with earnings. First, a respondent’s age might be positively associated with yearly earnings at younger ages, but less so at older ages. For instance, most people when they begin their careers are promoted in the first few years, and continue to rise in salary until their middle ages. Later in life, people’s earnings tend to plateau, and sometimes even decline, as they head towards retirement.

Second, it is also plausible that respondents’ years of education is non-linearly related to earnings. While it seems a fairly straightforward expectation that earnings will increase with education, this may not be true amongst the very highly educated. Many people who earn PhDs rather than MSc degrees, for instance, earn less than their contemporaries who leave university without doctorates. This is largely because they select in to less well-paying jobs (such as university lecturers) than they might have selected had they completed less education.

  1. Use the aggregate() function to calculate the average value of the outcome variable for each of the independent variables given in the answer to part a of this question. Plot these averages on the y-axis, with the values of the independent variable on the x-axis. Empirically, is it true that these variables appear to have a non-linear relationship with the outcome?

    Code Hint:The aggregate() function takes three main inputs:
Argument Purpose
x The variable that you would like to aggregate.
by The variable or variables that you would like to use to group the aggregation by. Must be included within the list() function.
FUN The function that you would like to use in the aggregation (i.e. mean(), sum(), median(), etc)
## Aggregate earnings by age
mean_earnings_by_age <- aggregate(nsw_obs$re78,by = list(nsw_obs$age), FUN = mean) 

## Aggregate earnings by years of education
mean_earnings_by_educ <- aggregate(nsw_obs$re78,by = list(nsw_obs$educ), FUN = mean)

If you print the results of the aggregate() function, you will see that R has constructed a data.frame with the mean level of earnings for respondents with each year of age/education. The code below takes these data.frames and uses the plot function to display the information graphically.

# with baseR plotting
par(mfrow = c(1,2)) # The mfrow argument to the par command allows more than one plot to be displayed at once. Here, I am using c(1,2) to tell R that I would like a grid of plots with 1 row and 2 columns

plot(x = mean_earnings_by_age$Group.1,
     y = mean_earnings_by_age$x, 
     type = "b", # This option makes a cool graph with both lines and points!
     xlab = "Age",
     ylab = "Mean earnings",
     main = "Mean earnings by age")

plot(x = mean_earnings_by_educ$Group.1,
     y = mean_earnings_by_educ$x, 
     type = "b", # This option makes a cool graph with both lines and points!
     xlab = "Years of Education",
     ylab = "Mean earnings",
     main = "Mean earnings by years of education")

par(mfrow = c(1,1)) # The disables the previous command par(mfrow = c(1,2))

# with ggplot

p_age <- ggplot(nsw_obs, aes(x=age, y=re78)) +
  geom_smooth(method = "loess") +
  # loess stands for "LOcally Estimated Scatterplot Smoothing" 
  # Essentially, it estimates a moving average at every value of X
  facet_wrap(~"Age")  +
  labs(y="Respondent's Earnings in 1978", x="Age") +
  theme_clean() +
  theme(plot.background = element_rect(fill = "transparent", color = NA))

p_educ <- ggplot(nsw_obs, aes(x=educ, y=re78)) +
  geom_smooth(method = "loess") +
  facet_wrap(~"Education")  +
  labs(y="Respondent's Earnings in 1978", x="Years of Education") +
  theme_clean() +
  theme(plot.background = element_rect(fill = "transparent", color = NA))


The left-panel of the plot makes clear that age has a very obvious non-linear relationship with income. While average wages increase sharply between 16 and about 30, they then tend to plateau before declining from the late 40s.

The right-panel does not suggest that earnings vary non-linearly with years of education, or at least not obviously so. In general, the more years of schooling an individual has, the more she earns (in this data at least).

  1. Run a regression which predicts income as a function of the treatment, age, and education. Run a second regression which predicts income as a function of the treatment, age, education, age\(^2\) and education\(^2\). To include a polynomial in the model, you will need to use the I() function within the lm() function (see the lecture slides from this week for an example). Code Hint: You may also find it useful to be able to compare the regression coefficients from the two models side by side. There are many functions in R that enable us to make side by side comparisons of multiple models, here we will use the screenreg() function from the texreg package that you installed above (if you missed this step, do it now!). The main argument to the screenreg() function is a list which contains the model objects that you would like to compare, e.g. screenreg(list(model_one_name, model_two_name))
linear_model <- lm(re78 ~ treat + age + educ, data = nsw_obs)

quadratic_model <- lm(re78 ~ treat + age + I(age^2) + educ + I(educ^2),
                      data = nsw_obs)

screenreg(list(linear_model, quadratic_model))
## ========================================
##              Model 1       Model 2      
## ----------------------------------------
## (Intercept)   1139.72 **   -17935.68 ***
##               (435.23)      (1108.54)   
## treat        -6859.48 ***   -6890.73 ***
##               (781.78)       (769.30)   
## age            166.38 ***    1383.40 ***
##                 (7.11)        (49.53)   
## educ           751.36 ***     898.61 ***
##                (26.92)       (124.70)   
## age^2                         -17.29 ***
##                                (0.70)   
## educ^2                        -12.92 *  
##                                (5.33)   
## ----------------------------------------
## R^2              0.07           0.10    
## Adj. R^2         0.07           0.10    
## Num. obs.    18667          18667       
## ========================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
  1. Are either of the squared terms significant in the second model? What is the null hypothesis associated with the squared terms? Does the effect of the treatment variable change moving from model 1 to model 2?

Both the coeffecient associated with the I(age^2) and the I(educ^2) terms are significant at conventional levels. Given that the null hypothesis associated with these terms is that there is a linear relationship between age and earnings, or education and earnings, the implication is that we can reject that null hypothesis and therefore conclude that in both cases there is evidence in the data of non-linearity. It is hard to say much more substantively than that because the coefficients associated with polynomial terms like these are difficult to interpret directly.

The estimated treatment effect does not change very much between the two models. In model 1, respondents who participated in the NSW programme earned on average $6859 less than those in the control group. The equivalent effect in model 2 is $6891. This suggests that, in this particular case, the estimated treatment coefficient is not particularly sensitive to the specification of the age and educ variables. Though it is worth remembering that this will not always be the case!

  1. Interpreting regression coefficients associated with polynomial variables can be difficult. To gain some substantive understanding about the size and shape of the relationships that you have just estimated, calculate fitted values for re78 using the predict() function across the range of both years of education and age. Plot these values on a graph. What do they suggest about the non-linear relationship between these variables and the outcome? Code Hint: Recall that the predict function requires values for all the explanatory variables that are included in your models. For this exercise, calculate fitted values for individuals in the control group. When calculating fitted values across the range of age, set education to the median value in the data. When calculating fitted values across the range of education, set age to the median value in the data.
age_fitted <- predict(quadratic_model, 
                      newdata = data.frame(treat = 0,
                                           educ = median(nsw_obs$educ),
                                           age = 16:55))

educ_fitted <- predict(quadratic_model,
                       newdata = data.frame(treat = 0,
                                            educ = 0:18,
                                            age = median(nsw_obs$age)))

# with baseR
par(mfrow = c(1,2))

plot(x = 16:55,
     type = "l",
     xlab = "Age",
     ylab = "Fitted value from quadratic regression",
     main = "Fitted values by age")

plot(x = 0:18,
     type = "l",
     xlab = "Years of education",
     ylab = "Fitted value from quadratic regression",
     main = "Fitted values by education")

par(mfrow = c(1,1))

## with ggplot and sjPlot

 # median(nsw_obs$educ) = 12
 # median(nsw_obs$age) = 31

fit_age <- plot_model(quadratic_model,
                      type = "pred", # this tells the function that we want predicted values
                      terms = c("age [16:55]", 
                                "treat [0]", "educ [12]")) +
  ggtitle("Fitted Values by Age")  +
  labs(y="Fitted value from quadratic regression", x="Age") +
  theme_clean() +
  theme(plot.background = element_rect(fill = "transparent", color = NA))

fit_educ <- plot_model(quadratic_model,
                       type = "pred", # this tells the function that we want predicted values
                       terms = c("educ [0:18]", 
                                 "treat [0]", "age [31]")) +
  ggtitle("Fitted Values by Education")  +
  labs(y="Fitted value from quadratic regression", x="Years of education") +
  theme_clean() +
  theme(plot.background = element_rect(fill = "transparent", color = NA))


These plots demonstrate that, despite the fact that both of the quadratic terms in our model were significant, there is only really a clear non-linear relationship between age and earnings. The age plot demonstrates a clear “inverted-U” shape, in which earnings increase with age up until the age of about 40, and subsequently decline. Another way of putting this is that the effect of a one-unit increase in age has different effects depending on where on the scale of that variable you are looking. A one-year increase from 20 to 21 is clearly associated with an increase in earnings. A one-year increase from 50 to 51 is not, however.

The education plot suggest that there the relationship between education and earnings is very close to linear, with only a very slight indication of decreasing returns to education at the top end of the scale. This emphasises the fact that we should not only pay attention to the statistical significance of our results (which here implied the presence of a non-linear relationship), but also to the substantive significance (here we see that any non-linearity is very small relative to the overall upward trend in the data).

Heteroskedasticity and Robust Standard Errors

  1. Run a linear regression which includes the same variables as those you used in question 2 of last week’s matching homework. How close is your estimate of the treatment effect in the regression to the estimate you calculated by using matching?
full_linear_model <- lm(re78 ~ treat + age + educ + black +
                          married + nodegree + re74 + re75 + hisp,
                        data = nsw_obs)
## Call:
## lm(formula = re78 ~ treat + age + educ + black + married + nodegree + 
##     re74 + re75 + hisp, data = nsw_obs)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60320  -3455    719   3536 110490 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4427.01000  449.17758   9.856  < 2e-16 ***
## treat        640.50407  583.24348   1.098 0.272142    
## age         -110.15438    5.88059 -18.732  < 2e-16 ***
## educ         247.01682   28.79738   8.578  < 2e-16 ***
## black       -387.14898  190.57794  -2.031 0.042224 *  
## married      180.24435  144.32611   1.249 0.211729    
## nodegree     645.34708  179.45239   3.596 0.000324 ***
## re74           0.29784    0.01112  26.784  < 2e-16 ***
## re75           0.51173    0.01112  46.038  < 2e-16 ***
## hisp         -50.17502  228.22304  -0.220 0.825990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7588 on 18657 degrees of freedom
## Multiple R-squared:  0.5151, Adjusted R-squared:  0.5148 
## F-statistic:  2202 on 9 and 18657 DF,  p-value: < 2.2e-16

This model suggests that, conditional on the other covariates in the model, treated respondents reported earnings that were $641 higher than those in the control group. This compares to a point estimate for the ATT of $831 from the matching exercise using the same covariates last week. Note that the t-statistic here is relatively small (t = 1.1) and the p-value quite large (p = 0.27) implying that we cannot reject the null hypothesis of no effect in this case.

  1. By default, the standard errors calculated by the lm() function in R are based on an assumption that the variance of the outcome residuals is the same for all values of the explanatory variables. To assess the plausibility of this assumption in this example, use the resid() function to extract the residuals from the model you estimated in the question above. Create three plots, each with these residuals on the y-axis, and where the x-axes are the treatment and two other explanatory variables that you included in your model. What do these plots suggest about the validity of this assumption here? What are the potential consequences for our estimated standard errors?
nsw_obs$estimated_residuals <- resid(full_linear_model)

# with baseR
par(mfrow = c(1,3))
plot(x = nsw_obs$treat, 
     y = nsw_obs$estimated_residuals,
     xlab = "Treatment status",
     ylab = "Model residual")

plot(x = nsw_obs$age, 
     y = nsw_obs$estimated_residuals,
     xlab = "Age",
     ylab = "Model residual")

plot(x = nsw_obs$educ, 
     y = nsw_obs$estimated_residuals,
     xlab = "Education",
     ylab = "Model residual")

par(mfrow = c(1,1))

# with ggplot
resid_treat <-ggplot(nsw_obs,aes(x=treat, y=estimated_residuals)) +
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(y="Model residual", x="Treatment status") +
  scale_x_continuous(breaks = c(0,1)) +
  theme_clean() +
  theme(plot.background = element_rect(fill = "transparent", color = NA))
resid_age <-ggplot(nsw_obs,aes(x=age, y=estimated_residuals)) +
  geom_point() +
  geom_smooth() + 
  labs(y="Model residual", x="Age") +
  theme_clean() +
  theme(plot.background = element_rect(fill = "transparent", color = NA))
resid_educ <-ggplot(nsw_obs,aes(x=educ, y=estimated_residuals)) +
  geom_point() +
  geom_smooth() + 
  labs(y="Model residual", x="Years of Education") +
  theme_clean() +
  theme(plot.background = element_rect(fill = "transparent", color = NA))


The normal OLS standard errors are assumed to be homoskedastic, meaning that the variance of the error term should be equal for all covariate values. These plots suggests that this assumption may be violated in this case because the variance of the residuals is clearly different for some covariate values than others. In particular, the control group error variance is much larger than the residual variance of the treatment group. Similarly, there is some evidence of non-constant error variance for both the education and age variables. People with the lowest levels of schooling have significantly smaller residual variance than those with more years; a pattern that is also repeated to some extent for young versus old respondents.

When the variance of our model errors differs for respondents with different covariate values – what we call heteroskedastic errors – it is possible that our estimated standard errors will be incorrect. We cannot know a priori whether heteroskedasticity will lead us to under or overstate the relevant standard errors, so it is usually best to implement robust standard errors as illustrated below.

  1. Use the lm_robust() function from the estimatr package to estimate the regression with heteroskedasticity-robust standard errors for your regression model. Present the results of your corrected model beside the results of the original model. What do you conclude? Code Hint: The lm_robust() function works nearly exactly like lm(), except that you can specify se_type =. Here, for heteroskedasticity-consistent standard errors, we use "HC3" (see slide 59) Note that, if you specify se_type = "classical", you get the same standard errors than with lm().

linear_model_corrected <- lm_robust(re78 ~ treat + age + educ + black +
                                      married + nodegree + re74 + re75 + hisp,
                                    data = nsw_obs, 
                                    se_type = "HC3")

          include.ci = FALSE)
## =======================================
##              Model 1       Model 2     
## ---------------------------------------
## (Intercept)   4427.01 ***   4427.01 ***
##               (449.18)      (468.68)   
## treat          640.50        640.50    
##               (583.24)      (623.26)   
## age           -110.15 ***   -110.15 ***
##                 (5.88)        (5.72)   
## educ           247.02 ***    247.02 ***
##                (28.80)       (30.52)   
## black         -387.15 *     -387.15 *  
##               (190.58)      (183.03)   
## married        180.24        180.24    
##               (144.33)      (144.77)   
## nodegree       645.35 ***    645.35 ***
##               (179.45)      (177.44)   
## re74             0.30 ***      0.30 ***
##                 (0.01)        (0.02)   
## re75             0.51 ***      0.51 ***
##                 (0.01)        (0.02)   
## hisp           -50.18        -50.18    
##               (228.22)      (228.73)   
## ---------------------------------------
## R^2              0.52          0.52    
## Adj. R^2         0.51          0.51    
## Num. obs.    18667         18667       
## RMSE                        7587.70    
## =======================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Comparing across the two models we can see that the standard errors in model 2 differ from those in model 1, though it is worth noting that the point estimates remain the same. In some cases the standard errors are larger after accounting for heteroskedasticity (as with the treatment variable, education, re74 and re75), and in some cases the standard errors are smaller after accounting for heteroskedasticity. This illustrates the fact that failures of the homoskedasticity assumption can lead to either upwards or downward bias in our standard error estimates.

In this case, the correction does not appear to make a large difference to the conclusions that we draw. For none of the variables would we change the conclusion of our hypothesis tests after correcting for heteroskedasticity. The standard error for our main treatment effect increased from 583 to 623 – a non-neglible change – but given that we were already failing to reject the null hypothesis of no effect using the original standard errors the correction is not consequential here. However, again, it is worth noting that this will not always be the case!


For your homework this week, you should spend some time thinking about the research proposal you will write as a part of the assessment at the end of this course. As we have not covered the full range of identification strategies or inference approaches yet, you should focus on identifying an area of social science research that you would like to write about. Crucially, you need to think hard about which causal research question you would like to answer. This may change over the next few weeks as you begin to think about an appropriate strategy, and about the available data, but it is good to get started on the initial stages of this now.

4.2 Quiz

  1. What are the three steps that the OLS estimator performs in a long regression?
  1. Randomly assign treatment, check balance in covariates, regress outcome on the treatment
  2. Regress treatment on the included covariate, compute residuals, regress outcome on the residuals
  3. Match treatment on the included covariate, compute differences in outcome, regress differences in outcome on treatment assignment
  4. Regress outcome on the treatment, compute residuals, regress residuals on the included covariate
  1. How is OLS weighting the conditional average treatment effects?
  1. More weight is given to subclasses of covariates that occur more often
  2. Less weight is given to subclasses with balance in treatment assignment
  3. Less weight is given to subclasses of covariates that occur more often and that have balance in treatment assignment
  4. More weight is given to subclasses of covariates that occur more often and that have balance in treatment assignment
  1. When is omitted variable bias a problem we should worry about when estimating a regression model in observational designs?
  1. Never
  2. When the omitted variable correlates with the outcome variable
  3. When the omitted variable is affected by the outcome variable
  4. When the omitted variable correlates with treatment and outcome variables
  1. What problem is implied by the curse of dimensionality?
  1. That the treatment is not assigned at random
  2. That we need to include non-linear transformations of control variables
  3. That the available number of observations for each combination of control subclasses reduces with inclusion of covariates
  4. That the standard errors are unreliably small
  1. What assumptions on the error terms of a regression must hold for standard errors to be reliable?
  1. That the error term is independent of the included regressors
  2. That the error term is correlated within groups of units
  3. That the error terms has larger variance as included covariate X increases
  4. Constant variance of the error terms and errors independently identically distributed