4 Selection on Observables (II)


4.1 Lecture review and reading

In lecture this week we revisited the linear regression model, and discussed in particular the circumstances under which regression coefficients resemble the average treatment effects that are our core concern on this course. If you need a refresher on linear regression, then chapter 2 in the Mastering ’Metrics book is excellent. In addition to covering the basics of regression analysis, it provides a very accessible treatment of omitted variable bias (p. 68 - 74 and p. 89 - 90), as well as helpful summaries of regression with logged dependent variables (p. 93 - 94) and uncertainty in regression analysis (p. 95 -96). All of these topics are also covered in the MHE book (chapter 3), though at a higher level of mathematical detail.

We spent a reasonable amount of time describing the implicit conditional-treatment-weighting that is imposed by OLS when calculating treatment effects while conditioning on observed covariates. The MHE book provides a deriviation and explanation of this result on pages 73 to 76, and there is a similar treatment in the Morgan and Winship book (chapter 6). One of the implications of this implicit weighting scheme is that the effective sample on which the results of any observational regression are based may be very different from the full population of interest; a result that is clearly and cleanly articulated in the Aronow and Samii, 2016 paper (I strongly recommend this paper!).

Two other recommendations: First, the 1983 paper by Rosenbaum and Rubin is a canonical piece in the literature on propensity scores. Second, the paper by Ho et. al (2007) gives an excellent (and intuitive) account of how combining matching with regression analysis can reduce model dependence in causal analyses. Both are worth taking a look at.

4.2 Seminar

This week we will be learning how to use linear regression to estimate causal effects (under the selection on observables assumption) in R. We will also focus on methods for estimating the propensity score for a given set of data. Finally, we will also learn some functions for interpreting the output of our models.

4.2.1 Data

We will continue to use the use the Eggers and Hainmueller (2009) data that we introduced last week. If you didn’t complete the previous problem set, or if you are working from a new computer, please download that dataset now using the link in the previous sentence, or from the week 3 seminar page.

Also as with last week, we will need to use the Matching and foreign packages, so you should begin by loading these at the top of your script. As we will again be doing some random number generation again this week, it is also worth using the set.seed() function again to ensure that the results remain the same everytime we run them.

library(Matching)
library(foreign)
set.seed(12345)

4.2.2 MPs for Sale? (Revisited) – Eggers and Hainmueller (2009)

We will begin by revisiting the Eggers and Hainmueller paper from last week. Recall that the authors of this paper are interested in establishing whether there is a causal effect of holding political office on the personal wealth of an individual. Last week we estimated these effects using various forms of matching. This week we will investigate similar claims using regression and propensity scores.

You should begin by loading the data:

load("eggers_mps.Rdata")

As a reminder of some of the material we covered in week 2, linear regression models are estimated in R using the lm function. The main arguments for which are as follows:

Argument Description
formula The formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable.1 + independent.variable.2
data This is the name of the dataset that contains the variable of interest

Remember, if you would like more information on how the lm() function works, type help(lm) or ?lm in R.

Question 1. Fitting a linear regression model

a) Fit a simple linear regression model for the mps data, where our outcome variable is the logged wealth at death of an individual (lnrealgross), and the predictor variable is whether the individual attended an Oxbridge university (ucat_ox). What do the two coefficients on this model represent?

Reveal answer

oxbridge_model <- lm(lnrealgross ~ ucat_ox, mps)
oxbridge_model

Call:
lm(formula = lnrealgross ~ ucat_ox, data = mps)

Coefficients:
(Intercept)      ucat_ox  
    12.5084       0.3936  

The intercept represents the mean level of the (log) wealth of individuals who did not go to an Oxbridge university, and the coefficient on the ucat_ox variable represents the estimated difference in means. The mean level of log wealth for individuals from Oxford is therefore given by the sum of the two coefficients.

Question.2 Calculating fitted values (and dealing with logged dependent variables)

Once we have estimated a regression model, we can use that model to produce fitted values. Recall that the forumla for fitted values is:

\[\hat{Y}_{i} = \hat{\beta}_0 + \hat{\beta}_1 * X_i\]

a) Using this formula and the model we estimated in question 1, calculate the 2 fitted values by hand: one for individuals who did study at an Oxbridge university and one for individuals who did not study at an Oxbridge university.

Reveal answer

Did study at Oxbridge: \[\hat{Y}_{i} = 12.5084 + 0.3936 * 1 = 12.90196\] Did not study at Oxbridge: \[\hat{Y}_{i} = 12.5084 + 0.3936 * 0 = 12.50836\]

Of course, in this case, this is a trivial calculation. As we add more variables to our model, the calculation becomes more laborious and so we will use R instead. To calculate fitted values in R, we use the predict() function.

The predict function takes two main arguments.

Argument Description
object The object is the model object that we would like to use to produce fitted values.
newdata This is an optional argument which we use to specify the values of our independent variable(s) that we would like fitted values for. If we leave this argument empty, R will automatically calculate fitted values for all of the observations in the data that we used to estimate the original model. If we include this argument, we need to provide a data.frame which has a variable with the same name as the independent variable in our model

b) Use the predict function to calculate the same fitted values we calculated by hand above.

Reveal answer

fitted_oxbridge <- predict(oxbridge_model, newdata = data.frame(ucat_ox = c(0,1)))
fitted_oxbridge
       1        2 
12.50836 12.90196 

This is the same as the result we obtained when we calculated the fitted value manually. The good thing about the predict() function, however, is that we will be able to use it for all the models we study on this course, and it can be useful for calculating many different fitted values.

c) Note that the dependent variable here is measured in log wealth at death. Therefore, in order to interpret these fitted values, we might want to convert them into a more meaningful quantity by exponentiating the resulting values. Do this now using the exp() function. How does the coefficient estimate for \(\hat{\beta}\) relate to these fitted values?

Reveal answer

exp(fitted_oxbridge)
       1        2 
270590.9 401096.0 

This suggests that individuals who received an Oxbridge education had a total wealth at death of approximately £400,000, while people who did not attend Oxbridge had only approximately £270,000 at the time of their death.

Note also that we can link these exponentiated fitted values to the estimated value of the \(\hat{\beta}\) coefficient from the regression model. We know that the fitted values are calculated according to:

Individuals who did not study at Oxbridge: \[ln(\hat{Y}_{\text{Oxbridge} =0}) = \hat{\alpha}\]

Individuals who did study at Oxbridge: \[ln(\hat{Y}_{\text{Oxbridge} =1}) = \hat{\alpha} + \hat{\beta}\]

The difference in the fitted values is therefore equal to:

\[ln(\hat{Y}_{\text{Oxbridge} =1}) - ln(\hat{Y}_{\text{Oxbridge} =0}) = (\hat{\alpha} + \hat{\beta}) - \hat{\alpha} = \hat{\beta}\]

Recalling that the difference in the logs is the same as the log of the ratio, we have:

\[ln \left(\frac{\hat{Y}_{\text{Oxbridge} =1}}{\hat{Y}_{\text{Oxbridge} =0}}\right) = \hat{\beta}\]

Exponentiating both sides of the equaiton:

\[\frac{\hat{Y}_{\text{Oxbridge} =1}}{\hat{Y}_{\text{Oxbridge} =0}} = exp(\hat{\beta})\]

This shows that the exponentiated value of the estimated coefficient will be equal to the ratio of the fitted values. So, while the literal interpretation of the estimated coefficient \(\hat{\beta}\) is that a one-unit increase in X will produce an expected increase in log Y of \(\hat{\beta}\) units, in terms of Y itself, this means that the expected value of Y is multiplied by \(exp(\hat{\beta})\).

To check that this is correct, notice that if we multiply the exponentiated fitted value for individuals without an Oxbridge education by the exponentiated value of \(\hat{\beta}\), the resulting number is equal to the exponentiated fitted value of individuals with an Oxbridge education:

exp(fitted_oxbridge[1]) * exp(coef(oxbridge_model)[2])
     1 
401096 
exp(fitted_oxbridge[2])
     2 
401096 

Question. 3 ATE with OLS

a) Recall the matching exercise from question 4 from last week, where we matched on an extensive list of covariates before estimating our treatment effect. Using the same set of covariates, estimate the ATE of the treatment variable (treated) on log weatlh at death (lnrealgross) using OLS. In particular, fit the model:

\[Y_i = \hat{\alpha} + \hat{\beta_1}D_i + \hat{\gamma}X + u_i\]

Where \(\hat{\alpha}\) is an intercept, \(\hat{\beta_1}\) is the OLS estimator for the ATE, and \(\hat{\gamma}\) is a vector of coefficients for the control variables.

If we want to interpret the ATE estimate produced from this regression model as the causal effect of abduction on years of education what is our identification assumption? Are your results different from those estimated in question 4 from last week? If so, why?

Reveal answer

ate_ols_model <- lm(lnrealgross ~ treated + female + aristo +  
                                  scat_pub +  scat_eto + scat_nm + scat_sec + 
                                  oc_teacherall + oc_barrister + oc_solicitor +  oc_dr + 
                                  oc_civil_serv + oc_local_politics + oc_business +  
                                  oc_white_collar +  oc_union_org + oc_journalist + 
                                  oc_miner + ucat_ox + ucat_deg + ucat_nm, 
                    data = mps)

summary(ate_ols_model)

Call:
lm(formula = lnrealgross ~ treated + female + aristo + scat_pub + 
    scat_eto + scat_nm + scat_sec + oc_teacherall + oc_barrister + 
    oc_solicitor + oc_dr + oc_civil_serv + oc_local_politics + 
    oc_business + oc_white_collar + oc_union_org + oc_journalist + 
    oc_miner + ucat_ox + ucat_deg + ucat_nm, data = mps)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7762 -0.4272  0.0070  0.4227  3.6947 

Coefficients: (2 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       12.116513   0.126995  95.409  < 2e-16 ***
treated            0.426523   0.105156   4.056 5.99e-05 ***
female             0.185706   0.242709   0.765 0.444634    
aristo             0.326693   0.288986   1.130 0.258942    
scat_pub           0.142353   0.127702   1.115 0.265629    
scat_eto           0.994457   0.225246   4.415 1.30e-05 ***
scat_nm            0.125233   0.129015   0.971 0.332284    
scat_sec                 NA         NA      NA       NA    
oc_teacherall     -0.091781   0.168410  -0.545 0.586065    
oc_barrister       0.595135   0.173096   3.438 0.000646 ***
oc_solicitor       0.227593   0.202862   1.122 0.262565    
oc_dr              0.163262   0.328302   0.497 0.619252    
oc_civil_serv      0.428085   0.465959   0.919 0.358790    
oc_local_politics -0.049309   0.115434  -0.427 0.669488    
oc_business        0.313215   0.144073   2.174 0.030284 *  
oc_white_collar    0.231080   0.167733   1.378 0.169068    
oc_union_org      -0.349147   0.327809  -1.065 0.287468    
oc_journalist      0.167803   0.164697   1.019 0.308881    
oc_miner          -0.330990   0.588727  -0.562 0.574282    
ucat_ox            0.167860   0.136655   1.228 0.220029    
ucat_deg           0.000622   0.123522   0.005 0.995984    
ucat_nm                  NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9979 on 405 degrees of freedom
Multiple R-squared:  0.1716,    Adjusted R-squared:  0.1328 
F-statistic: 4.416 on 19 and 405 DF,  p-value: 4.395e-09

The identifying assumption is the independence of treatment and potential outcomes conditional on the observed covariates, \(Y_{1i}, Y_{0i} || D_i|X_i\) (i.e. conditional independence assumption). We also need the common support assumption, \(0 < Pr(D_i = 1|X_i = x) < 1\). Note that these are the same assumptions as we needed last week for matching – both rely on the same identification assumptions, they are just different estimators.

The results are different, and this could be for two reasons. First, the ATE could be different because of the way that OLS implicitly weights the units. As explained in detail in the MHE book (p. 73-76) OLS puts greater weight on the units that have higher conditional-treatment-variance in X, whereas matching estimators simply weight by the probability distributions of X in either the entire sample (ATE), the treatment group (ATT), or the control group (ATC). The second reason for the difference is that last week we were using exact matching, which meant that a number of our observations were being dropped from the sample before we estimated the treatment effect. OLS does not automatically drop units for which there are no perfect matches, and instead will extrapolate to “fill in” the missing potential outcomes for those values.

Question. 4 Estimating the propensity score

We will now use propensity score matching to generate an alternative measure of the average treatment effect. To do so, we will need to first estimate the propensity score (i.e. the probability of receiving the treatment) for each unit in our sample. To do so, we will use a linear probability model (i.e. a linear regression on a binary outcome variable), the fitted values from which will be our estimates of the propensity score (note that we could alternatively use a logistic regression model to estimate the relationship between the treatment and covariates. The fitted values from both models are likely to be similar).

a) Specify a propensity score model using the lm() function, where the outcome is the treated variable, and you use the same control variables as you did for the previous question. Then use the predict() function to calculate the fitted values for each observation in the data (remember that if we leave the newdata argument empty, R will automatically calculate fitted values for all of the observations in the data that we used to estimate the original model). Assign these fitted values to a new variable in the mps data called p.score.

Reveal answer

linear_pscore_model <- lm(treated ~ female + aristo +  
                                  scat_pub +  scat_eto + scat_nm + scat_sec + 
                                  oc_teacherall + oc_barrister + oc_solicitor +  oc_dr + 
                                  oc_civil_serv + oc_local_politics + oc_business +  
                                  oc_white_collar +  oc_union_org + oc_journalist + 
                                  oc_miner + ucat_ox + ucat_deg + ucat_nm,
                          data = mps)

mps$p.score <- predict(linear_pscore_model)

b) Create a density plot of the propensity scores (i.e. the fitted values) that you have just estimated using the density(), plot() and lines() functions. Plot one line for the treated observations, and one line for the control observations. Are the distributions of the propensity score the same for treatment and control observations?

Reveal answer

plot(density(mps$p.score[mps$treated==1]), main = "Propensity score by treatment status", xlab = "Estimated propensity score")
lines(density(mps$p.score[mps$treated==0]), lty=2)
legend("topright", lty=c(1,2),legend = c("Treated", "Control"))

In the above code, the density function is computing the density of the propensity score, and we are doing this twice: once for treated observations and once for control observations. The plot function will automatically recognise the output of the density function and plot a line. The lines function allows us to add another line to an already existing plot (we set lty = 2 to tell R to draw a dashed line rather than a solid line so that we can distinguish between the two groups).

The plot reveals that the propensity for treatment, at least as predicted by the specification we chose, tends to be higher among treated units (shown by the solid line) that in control units (the dashed line). This is in fact by construction, as we are using the observed treatment indicator to estimate the propensity scores. If, prior to any matching or adjustment, treated and control units were perfectly balanced on the covariates we included in our specification, the densities of the estimated propensities would be near identical. This is what we would expect to occur, for example, if the treatment was randomly assigned.

c) Using one-to-one matching on the estimated propensity scores, estimate the ATT. Report and interpret the ATT. How does it compare the estimates from last week and from the regression-based approach you implemented above?

Reveal answer

match_pscore_out <- Match(Y = mps$lnrealgross,
                          Tr = mps$treated,
                          X = mps$p.score,
                          estimand = "ATT",
                          M = 1)

summary(match_pscore_out)

Estimate...  0.66625 
AI SE......  0.1539 
T-stat.....  4.3292 
p.val......  1.4968e-05 

Original number of observations..............  425 
Original number of treated obs...............  165 
Matched number of observations...............  165 
Matched number of observations  (unweighted).  741 

The estimate here is very close to the estimate from question 4 last week, which resulted from directly matching on the covariates, and to the regression-based estimates from question 3 above. This emphasises the point that in many applied situations, matching on the propensity score, matching directly on covariates, and estimating a multiple regression model will give very similar estimates of the average treatment effect. This is because all three of these estimation strategies has the same basic idea at heart: we condition on observables to try to make the treatment and control groups more comparable.

d) Plot the densities of the propensity scores for the treated and untreated units using the matched data. (Remember that the object created by Match() contains two vectors, match_pscore_out$index.treated and match_pscore_out$index.control which give the indexes of the matched treated and control observations.) Examine also the balance on 3 of the underlying covariates used to calculate the propensity score using the MatchBalance() function. Has balance improved on the propensity score? Has balance improved on the underlying covariates?

Reveal answer

plot(density(mps$p.score[match_pscore_out$index.treated]), main = "Propensity score by treatment status", xlab = "Estimated propensity score")
lines(density(mps$p.score[match_pscore_out$index.control]), lty=2)
legend("topright", lty=c(1,2),legend = c("Treated", "Control"))

MatchBalance(treated ~ female + aristo + ucat_deg,
             data= mps,
             match.out = match_pscore_out)

***** (V1) female *****
                       Before Matching       After Matching
mean treatment........   0.036364          0.036364 
mean control..........       0.05          0.067677 
std mean diff.........    -7.2625           -16.677 

mean raw eQQ diff.....   0.012121         0.0053981 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 1 

mean eCDF diff........  0.0068182         0.0026991 
med  eCDF diff........  0.0068182         0.0026991 
max  eCDF diff........   0.013636         0.0053981 

var ratio (Tr/Co).....    0.73936           0.55536 
T-test p-value........    0.49417           0.16928 


***** (V2) aristo *****
                       Before Matching       After Matching
mean treatment........   0.066667          0.066667 
mean control..........  0.0076923          0.048485 
std mean diff.........     23.571            7.2668 

mean raw eQQ diff.....   0.060606                 0 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 0 

mean eCDF diff........   0.029487                 0 
med  eCDF diff........   0.029487                 0 
max  eCDF diff........   0.058974                 0 

var ratio (Tr/Co).....     8.1698            1.3487 
T-test p-value........  0.0039662           0.31732 


***** (V3) ucat_deg *****
                       Before Matching       After Matching
mean treatment........    0.33939           0.33939 
mean control..........    0.36538           0.30798 
std mean diff.........    -5.4724            6.6143 

mean raw eQQ diff.....   0.024242         0.0053981 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 1 

mean eCDF diff........   0.012995         0.0026991 
med  eCDF diff........   0.012995         0.0026991 
max  eCDF diff........   0.025991         0.0053981 

var ratio (Tr/Co).....    0.96906             1.052 
T-test p-value........    0.58512           0.29313 


Before Matching Minimum p.value: 0.0039662 
Variable Name(s): aristo  Number(s): 2 

After Matching Minimum p.value: 0.16928 
Variable Name(s): female  Number(s): 1 

The plot shows that the density lines for the propensity score from the two groups now almost perfectly overlap. This implies that the propensity scores are now clearly balanced across treated and control (unsurprisingly, given that this is what we were matching on!).

The results of MatchBalance() show that, for these three variables, balance (as measured by the standardized mean difference) has improved for aristo, stayed approximately the same for ucat_deg, and got worse for female. This highlights a major difference between propensity score matching and exact matching on covariates. While the latter guarantees improved in-sample balance on covariates, the former does not. We therefore might want to try to improve the performance of the propensity score matching balance by respecifying the propensity score model (perhaps by including interactions, etc).

4.3 Homework

For your homework this week, you should spend some time thinking about the research paper you will write at the end of this course. As we have not covered the full range of identification strategies or inference approaches yet, you should focus on identifying an area of political science research that you would like to write about. Crucially, you need to think hard about which causal research question you would like to answer. This may change over the next few weeks as you begin to think about an appropriate strategy, and about the available data, but it is good to get started on the initial stages of this now.

For more information about the research project, see this page of the website. In addition to providing guidance on the potential structure of your paper, that page also has three examples of projects from students who took a similar course to this one in previous years. Those papers are somewhat longer than what I expect from you, but they are good examples of the type of project you should be aiming to complete.

If you have an idea that you would like to run past me, please feel free to book into my office hours. Better yet, you could also post a quick summary of your idea (even if it is just a potential research question) as a comment on this post on Piazza. I will try to read and respond to any of these posts over the next few days, and you should also feel free to comment on the ideas of your fellow students.