5 Panel Data and Difference-in-Differences
When we can observe and measure potentially confounding factors, we can recover causal effects by controlling for these factors. Often, however, confounders may be difficult to measure or impossible to observe. If this is the case, we need alternative strategies for estimating causal effects. One approach is to try to obtain data with a time dimension, where one group receives a treatment at a given point in time but the other group does not. Comparing the differences between pre- and post-treatment periods for these two groups allows us to control for unobserved omitted variables that are fixed over time. Under certain assumptions, this can produce valid estimates of causal effects.
Chapter 5 in MHE gives a very good treatment of the main empirical issues associated with difference-in-differences analysis, and with panel data more generally. The relevant chapter in Mastering ’Metrics is especially good for this week, so both are worth consulting.
The papers by Card (1990) and Card and Krueger (1994) are classics in the diff-in-diff literature, and give a very intuitive overview of the basics behind the method. More advanced applications can be found in the paper by Ladd and Lenz (2009), which also provides a useful demonstration of how difference-in-difference analyses can be combined with matching in order to strengthen the parallel trends assumption, and the recent paper by Dinas et al (2018), which we will replicate in part below.
An example of using fixed-effect regression to estimate a difference-in-differences type model can be found in the paper by Blumenau (2019).
For those of you who are feeling very committed, an important paper on statistical inference for difference-in-difference models is this one by Bertrand et. al. (2004). Be warned, however, that essentially no-one has ever enjoyed time spent reading a paper that is almost entirely about standard errors.
5.1 Seminar
This week we will be learning how to implement a variety of difference-in-differences estimators, using both linear regression and fixed-effects regressions. We will also spend time learning more about R’s plotting functions, as visually inspecting the data is one of the best ways of assessing the plausibility of the “parallel trends” assumption that is at the heart of all difference-in-differences analyses.
Data
We will use two datasets this week. The first is from a paper by Dinas et. al. (2018) which examines the relationship between refugee arrivals and support for the far right. The second is from a classic paper by Card and Krueger (1994) which examines the effects of changes in the minimum wage on employment. Both can be downloaded using the links at the top of the page.
5.1.1 Refugees and support for the far right – Dinas et. al. (2018)
The recent refugee crisis in Europe has conincided with a period of electoral politics in which right-wing extremist parties have performed well in many European countries. However, despite this aggregate level correlation, we have surprisingly little causal evidence on the link between influxes of refugees, and the attitudes and behaviour of native populations. What is the causal relationship between refugee crises and support for far-right political parties? Dinas et. al. (2018) examine evidence from the Greek case. Making use of the fact that some Greek islands (those close to the Turkish border) witnessed sudden and unexpected increases in the number of refugees during the summer of 2015, while other nearby Greek islands saw much more moderate inflows of refugees, the authors use a difference-in-differences analysis to assess whether treated municipalites were more supportive of the far-right Golden Dawn party in the September 2015 general election. We will examine the data from this paper, replicating the main parts of their difference-in-differences analysis.
The dinas_golden_dawn.Rdata
file contains data on 96 Greek municipalities, and 4 elections (2012, 2013, 2015, and the treatment year 2016). The muni
data.frame contained within that file includes the following variables:
treatment
– This is a binary variable which measures 1 if the observation is in the treatment group (a municipality that received many refugees) and the observation is in the post-treatment period (i.e. in 2016). Untreated units, and treatment units in the pre-treatment periods are coded as zero.ever_treated
– This is a binary variable equal toTRUE
in all periods for all treated municipalities, and equal toFALSE
in all periods for all control municipalities.trarrprop
– continuous (per capita number of refugees arriving in each municipality)gdvote
– the outcome of interest. The Golden Dawn’s share of the vote. (Continuous)year
– the year of the election. (Can take 4 values: 2012, 2013, 2015, and 2016)
Use the load
function to load the downloaded data into R now.
- Using only the observations from the post-treatment period (i.e. 2016), implement a regression which compares the Golden Dawn share of the vote for the treated and untreated municipalities. Does the coefficient on this regression represent the average treatment effect on the treated? If so, why? If not, why not?
- Calculate the sample difference-in-differences between 2015 and 2016. For this question, you should calculate the relevant differences “manually”, in that you should use the
mean
function to construct the appropriate comparisons. What does this calculation imply about the average treatment effect on the treated?
- Use a linear regression with an appropriate interaction term to estimate the difference-in-differences. For this question, you should again focus only on the years 2015 and 2016. Note: To run the appropriate interaction model you will first need to convert the
year
variable into an appropriate dummy variable, where observations in the post-treatment period are coded as 1 and observations in the pre-treatment period are coded as 0.
- All difference-in-difference analyses rely on the “parallel trends” assumption. What does this assumption mean? What does it imply in this particular analysis?
- Assess the parallel trends assumption by plotting the evolution of the outcome variable for both the treatment and control observations over time. Are you convinced that the parallel trends assumption is reasonable in this application? Note: There are a number of ways to calculate the average outcome for treated and control units in each time period, and then to plot them on a graph. One solution is to use the
aggregate()
function which we used last week. Recall that theaggregate()
function takes the arguments listed below. Here we need to calculate the mean of thegdvote
variable by bothyear
and whether the unit wasever_treated
. It would also be possible to calculate each of the values manually and then store them in a data.frame for plotting, but this is likely to be very time consuming!
Argument | Purpose |
---|---|
x |
The variable that you would like to aggregate. |
by |
The variable or variables that you would like to use to group the aggregation by. Must be included within the list() function. |
FUN |
The function that you would like to use in the aggregation (i.e. mean() , sum() , median() , etc) |
- Use a fixed-effects regression to estimate the difference-in-differences. Remember that the fixed-effect estimator for the diff-in-diff model requires “two-way” fixed-effects, i.e. sets of dummy variables for a) units and b) time periods. Code Hint: In R, you do not need to construct such dummy variables manually. It is sufficient to use the
as.factor
function within thelm
function to tell R to treat a certain variable as a set of dummies. (So, here, tryas.factor(municipality)
andas.factor(year)
). You will also need to decide which of the two treatment variables (treatment
orever_treated
) is appropriate for this analysis (if you are struggling, look back at the lecture notes!)
- Using the same model that you implemented in question 6, swap the
treatment
variable for thetrarrprop
variable, which is a continuous treatment variable measuring the number of refugee arrivals per capita. What is the estimated average treatment effect on the treated using this variable?
5.1.2 Minimum wages and employment – Card and Krueger (1994)
On April 1, 1992, the minimum wage in New Jersey was raised from $4.25 to $5.05. In the neighboring state of Pennsylvania, however, the minimum wage remained constant at $4.25. David Card and Alan Krueger (1994) analyze the impact of the minimum wage increase on employment in the fast–food industry, since this is a sector which employs many low-wage workers.
The authors collected data on the number of employees in 331 fast–food restaurants in New Jersey and 79 in Pennsylvania. The survey was conducted in February 1992 (before the minimum wage was raised) and in November 1992 (after the minimum wage was raised). The table below shows the average number of employees per restaurant:
February 1992 | November 1992 | |
---|---|---|
New Jersey | 17.1 | 17.6 |
Pennsylvania | 19.9 | 17.5 |
- Using only the figures given in the table above, explain three possible ways to estimate the causal effect of the minimum wage increase on employment. For each appraoch, discuss which assumptions have to be made and what could bias the result.
Replication exercise
The dataset m_wage.dta
that you downloaded earlier includes the information necessary to replicate the Card and Krueger analysis. In contrast to the Dinas data, the dataset here is stored in a “wide” format, i.e. there is a single row for each unit (restaurant), and different columns for the outcomes and covariates in different years. The dataset includes the following variables (as well as some others which we will not use):
nj
– a dummy variable equal to 1 if the restaurant is located in New Jerseyemptot
– the total number of full-time employed people in the pre-treatment periodemptot2
– the total number of full-time employed people in the post-treatment periodwage_st
– a variable measuring the average starting wage in the restaurant in the pre-treatment periodwage_st2
– a variable measuring the average starting wage in the restaurant in the post-treatment periodpmeal
– a variable measuring the average price of a meal in the pre-treatment periodpmeal2
– a variable measuring the average price of a meal in the post-treatment periodco_owned
– a dummy variable equal to 1 if the restaurant was co-ownedbk
– a dummy variable equal to 1 if the restaurant was a Burger Kingkfc
– a dummy variable equal to 1 if the restaurant was a KFCwendys
– a dummy variable equal to 1 if the restaurant was a Wendys
You will need to load the read.dta
function in the foreign
package (call library(foreign)
before trying to call read.dta
) to access this data.
- Calculate the difference-in-difference estimate for the average wage in NJ and PA. Noting that the wage is not the outcome of interest in this case, what does this analysis suggest about the effectiveness of the minimum-wage policy? Note that there are some observations with missing data in this exercise (these are coded as
NA
in the data). You can calculate the mean of a vector with missing values by setting thena.rm
argument to be equal toTRUE
in themean
function.
- Calculate the difference-in-differences estimator for the outcome of interest (the number of full-time employees). Under what conditions does this estimate identify the average treatment effect on the treated? What evidence do you have to support or refute these conditions here?
- Calculate the difference-in-differences estimator for the price of an average meal. Do restaurants that were subject to a wage increase raise their prices for fast–food?
- Convert the dataset from a “wide” format to a “long” format (i.e. where you have two observations for each restaurant, and an indicator for the time period in which the restaurant was observed). Estimate the difference-in-differences using linear regression. You should run two models: one which only includes the relevant variables to estimate the diff-in-diff, and one which additionally includes restaurant-level covariates which do not vary over time. Do your estimates of the treatment effect differ? Note: The easiest way to achieve the data conversion is to notice that you can simply “stack” one data.frame (with information from the pre-treatment period) on top of another data.frame (with information from the post-treatment period). So, first create two data.frames with the relevant variables. Second, bind these two data.frames together using the
rbind()
function (the data.frames must have the same column names before they are joined). Note that you will have to create the relevant treatment period indicator before binding the data.frames together.