4 Selection on Observables (II)
This week we continue to consider methods that rely on a selection-on-observables assumption for making causal inferences from non-experimental data. In particular, this week we focus on assessing under which conditions linear regression can be used to make causal statements.
In lecture this week we revisited the linear regression model, and discussed in particular the circumstances under which regression coefficients resemble the average treatment effects that are our core concern on this course. If you need a refresher on linear regression, then chapter 2 in the Mastering ’Metrics book is excellent. In addition to covering the basics of regression analysis, it provides a very accessible treatment of omitted variable bias (p. 68 - 74 and p. 89 - 90), as well as helpful summaries of regression with logged dependent variables (p. 93 - 94) and uncertainty in regression analysis (p. 95 -96). All of these topics are also covered in the MHE book (chapter 3), though at a higher level of mathematical detail.
We spent a reasonable amount of time describing the implicit conditional-treatment-weighting that is imposed by OLS when calculating treatment effects while conditioning on observed covariates. The MHE book provides a derivation and explanation of this result on pages 73 to 76, and there is a similar treatment in the Morgan and Winship book (chapter 6). One of the implications of this implicit weighting scheme is that the effective sample on which the results of any observational regression are based may be very different from the full population of interest; a result that is clearly and cleanly articulated in the Aronow and Samii, 2016 paper (we strongly recommend this paper!). We also recommend the paper by Ho et. al (2007) which gives an excellent (and intuitive) account of how combining matching with regression analysis can reduce model dependence in causal analyses.
This week we will be learning how to use linear regression to estimate causal effects (under the selection on observables assumption) in R. We will also learn how to specify non-linear models for certain explanatory variables by using polynomials, and how to construct heteroskedasticity-robust standard errors. Finally, we will also learn some functions for interpreting the output of our models.
4.1 Seminar
Data
We will continue to use the use the Eggers and Hainmueller (2009) data and the Dehejia and Wahba (1999) observational data that we introduced last week. If you didn’t complete the previous problem set, or if you are working from a new computer, please download those datasets now using the links from the top of the page.
Packages
As with last week, we will need to use the foreign
package, so you should begin by loading that at the top of your script. You will also need a number of new packages this week:
texreg
, which can be used to easily output nicely formatted regression tables.estimatr
, which can be used to calculate models various types of robust standard errors.lmtest
, a package which contains further functions to help calculate alternative standard errors.
You will need to install and then load these as well. Finally, and as ever, you should also set your working directory to the correct location.
4.1.1 MPs for Sale? (Revisited) – Eggers and Hainmueller (2009)
Recall that Eggers and Hainmueller are interested in establishing whether there is a causal effect of holding political office on the personal wealth of an individual. Last week we estimated these effects using various forms of matching. This week we will investigate similar claims using regression.
You should begin by loading the data:
As a reminder of some of the material we covered in week 2, linear regression models are estimated in R using the lm
function. The main arguments for which are as follows:
Argument | Description |
---|---|
formula |
The formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable.1 + independent.variable.2 |
data |
This is the name of the dataset that contains the variable of interest |
Remember, if you would like more information on how the lm()
function works, type help(lm)
or ?lm
in R.
Fitting a Linear Regression Model
- Fit a simple linear regression model for the
mps
data, where our outcome variable is the logged wealth at death of an individual (lnrealgross
), and the predictor variable is whether the individual attended an Oxbridge university (ucat_ox
). Interpret the two coefficients that you estimate.
- Once we have estimated a regression model, we can use that model to produce fitted values. Recall that the formula for fitted values is: \[\hat{Y}_{i} = \hat{\alpha} + \hat{\beta}_1 * X_i\] Using this formula and the model we estimated in question 1, calculate the 2 fitted values by hand: one for individuals who did study at an Oxbridge university and one for individuals who did not study at an Oxbridge university.
- Use the predict function to calculate the same fitted values we calculated by hand above. Code Hint: As we add more variables to our model, fitted values calculations become more laborious and so we will use R instead. To calculate fitted values in R, we use the
predict()
function. The predict function takes two main arguments.
Argument | Description |
---|---|
object |
The object is the model object that we would like to use to produce fitted values. |
newdata |
This is an optional argument which we use to specify the values of our independent variable(s) that we would like fitted values for. If we leave this argument empty, R will automatically calculate fitted values for all of the observations in the data that we used to estimate the original model. If we include this argument, we need to provide a data.frame which has variables that have the same name as the independent variables in our model |
- Note that the dependent variable here is measured in log wealth at death. Therefore, in order to interpret these fitted values, we might want to convert them into a more meaningful quantity by exponentiating the resulting values. Do this now using the
exp()
function. How does the coefficient estimate for \(\hat{\beta}\) relate to these fitted values?
Estimating the ATE with Regression
- Recall the matching exercise from question 5 from last week, where we matched on an extensive list of covariates before estimating our treatment effect. Using the same set of covariates, estimate the ATE of the treatment variable (
treated
) on log wealth at death (lnrealgross
) using OLS. In particular, fit the model: \[Y_i = \hat{\alpha} + \hat{\beta_1}D_i + \hat{\gamma}X + u_i\] Where \(\hat{\alpha}\) is an intercept, \(\hat{\beta_1}\) is the OLS estimator for the ATE, and \(\hat{\gamma}\) is a vector of coefficients for the control variables. If we want to interpret the ATE estimate produced from this regression model as the causal effect, what is our identification assumption? Are your results different from those estimated in question 4 from last week? If so, why?
- Estimate the ATE with exact 1:1 matching. Compare the estimates and their standard errors. Are they different and, if yes, what are possible reasons for it? Code Hint: Contrary to last week, where we focussed on the ATT, you need to specify
estimand = "ATE"
.
4.1.2 National Supported Work (Revisited) – Dehejia and Wahba (1999)
In this section we will continue using the observational data from the study by Dehejia and Wahba (1999). Recall that the authors use data relating to the the National Supported Work programme to assess the plausibility of reconstructing experimental results using observational data.
We will use the NSW.dw.obs.dta
to learn about estimating non-linear relationships between our independent and dependent variables, and also to see how to correct our standard errors for the presence of heteroskedastic errors. Load the data using the read.dta()
function from the foreign
package:
The data includes 18667 observations, and the following variables:
treat
(1 = experimental treatment group; 0 = observational comparison group)age
(age in years)educ
(years of schooling)black
(1 if black; 0 otherwise)hisp
(1 if Hispanic; 0 otherwise)married
(1 if married; 0 otherwise)nodegree
(1 if no high school diploma; 0 otherwise)educcat
(1 = did not finish high school, 2 = high school, 3 = some college, 4 = graduated from college)re74
,re75
,re78
(real earnings in 1974, 1975 and 1978)
Recall that our main outcome of interest in this data is the self-reported earnings of respondents in 1978, which is stored as re78
in the data. The main explanatory variable of interest is treat
which is equal to 1 if the respondent was a participant in the NSW programme, and 0 otherwise.
Non-linearity in Regression Models
- Consider the other variables in the data - are there any variables that you would expect to have a non-linear association with earnings? Provide a theoretical account explaining why such non-linear relationships might exist.
- Use the
aggregate()
function to calculate the average value of the outcome variable for each of the independent variables given in the answer to part a of this question. Plot these averages on the y-axis, with the values of the independent variable on the x-axis. Empirically, is it true that these variables appear to have a non-linear relationship with the outcome? Code Hint:Theaggregate()
function takes three main inputs:
Argument | Purpose |
---|---|
x |
The variable that you would like to aggregate. |
by |
The variable or variables that you would like to use to group the aggregation by. Must be included within the list() function. |
FUN |
The function that you would like to use in the aggregation (i.e. mean() , sum() , median() , etc) |
- Run a regression which predicts income as a function of the treatment, age, and education. Run a second regression which predicts income as a function of the treatment, age, education, age\(^2\) and education\(^2\). To include a polynomial in the model, you will need to use the
I()
function within thelm()
function (see the lecture slides from this week for an example). Code Hint: You may also find it useful to be able to compare the regression coefficients from the two models side by side. There are many functions in R that enable us to make side by side comparisons of multiple models, here we will use thescreenreg()
function from thetexreg
package that you installed above (if you missed this step, do it now!). The main argument to thescreenreg()
function is a list which contains the model objects that you would like to compare, e.g.screenreg(list(model_one_name, model_two_name))
- Are either of the squared terms significant in the second model? What is the null hypothesis associated with the squared terms? Does the effect of the treatment variable change moving from model 1 to model 2?
- Interpreting regression coefficients associated with polynomial variables can be difficult. To gain some substantive understanding about the size and shape of the relationships that you have just estimated, calculate fitted values for
re78
using thepredict()
function across the range of both years of education and age. Plot these values on a graph. What do they suggest about the non-linear relationship between these variables and the outcome? Code Hint: Recall that the predict function requires values for all the explanatory variables that are included in your models. For this exercise, calculate fitted values for individuals in the control group. When calculating fitted values across the range of age, set education to the median value in the data. When calculating fitted values across the range of education, set age to the median value in the data.
Heteroskedasticity and Robust Standard Errors
- Run a linear regression which includes the same variables as those you used in question 2 of last week’s matching homework. How close is your estimate of the treatment effect in the regression to the estimate you calculated by using matching?
- By default, the standard errors calculated by the
lm()
function in R are based on an assumption that the variance of the outcome residuals is the same for all values of the explanatory variables. To assess the plausibility of this assumption in this example, use theresid()
function to extract the residuals from the model you estimated in the question above. Create three plots, each with these residuals on the y-axis, and where the x-axes are the treatment and two other explanatory variables that you included in your model. What do these plots suggest about the validity of this assumption here? What are the potential consequences for our estimated standard errors?
- Use the
lm_robust()
function from theestimatr
package to estimate the regression with heteroskedasticity-robust standard errors for your regression model. Present the results of your corrected model beside the results of the original model. What do you conclude? Code Hint: Thelm_robust()
function works nearly exactly likelm()
, except that you can specifyse_type =
. Here, for heteroskedasticity-consistent standard errors, we use"HC3"
(see slide 59) Note that, if you specifyse_type = "classical"
, you get the same standard errors than withlm()
.
Homework
For your homework this week, you should spend some time thinking about the research proposal you will write as a part of the assessment at the end of this course. As we have not covered the full range of identification strategies or inference approaches yet, you should focus on identifying an area of social science research that you would like to write about. Crucially, you need to think hard about which causal research question you would like to answer. This may change over the next few weeks as you begin to think about an appropriate strategy, and about the available data, but it is good to get started on the initial stages of this now.