2 Randomized Experiments

This week we will review the logic that underpins a research design that has become a mainstay of political science research: randomised experiments. We will focus on why randomisation is such a powerful force for making causal inferences (spoiler: internal validity), and will discuss the trade-offs implicit in experimental research (spoiler: external validity). In learning how to analyse experimental data, we will review the t-test and also cover regression as a tool for analysing experiments.

The main motivation for using randomized experiments is that when treatments are randomly assigned, both observed and unobserved potentially confounding factors will be balanced across treatment and control conditions in expectation. That is, randomization solves the selection bias problem that we outlined last week. The intuition behind this result is nicely described in both the Mostly Harmless book (Chapter 2), and also in the Mastering ’Metrics book (see, in particular, Chapter 1). As Angrist and Pischke put it, “Random assignment works not by eliminating individual differences but rather by ensuring that the mix of individuals being compared is the same.” (MM, p. 16)

For a review of statistical inference (sampling distributions, t-tests, standard errors, etc) the Masterin’ Metrics book has a nice appendix on pages 33-46. The Gerber and Green book (chapter 3) is also very useful, and very clear, though note it pays a good deal of attention to randomization inference (which we do not cover on this course), rather than classical statistical inference methods.

Chattopadhyay and Duflo (2004) is an excellent example of using the randomized nature of a real-life policy implementation to draw conclusions about an important political science question. We will be using data from this paper throughout today’s seminar. For the purposes of this course you can ignore the theoretical section of the paper (though it’s worth reading, as it’s an interesting model), but in short it concludes that we should expect there to be differences in policy outcomes between areas that are and are not governed by a female Pradhans (village chiefs). Instead you should focus on a) the detailed description of the randomization procedure, and b) what the authors did to ensure that randomization of treatment and control conditions was successful.

The Kalla and Brookman (2017) paper is an instructive example of using (several) field experiments that helps to overcome very tricky issues regarding selection bias in the context of a research question that focuses on the role of persuasion in politics. Another classic field experiment in political science is described in Gerber, Green and Larimer (2008) which in addition to being interesting in its own right also forms the basis of this week’s second part of the seminar task/homework. If it caught your interest in the lecture, you can also read the full paper by Banerjee et. al. (2015) which describes the results from 6 important experiments that aim to establish the causal effects of development aid on outcomes for the poor.

Finally, randomized experiments are playing an increasingly important role in policy-making, and it is worth having a look at the Test, Learn, Adapt paper produced by the Behavioural Insights Team and the Cabinet Office, which represents a call-to-arms for experimental methods in developing better public policy. In addition to situating the experimental methods we study in a broader policy-making context, this paper has a nice set of examples of successful public policy experiments that have been conducted over the past 20 years.

2.1 Seminar

The main statistical machinery for analysing randomized experiments should be familiar to you all: t-tests and linear regression. We will also need a number of other functions today, most of which are listed in the table below.

Function Purpose
mean Calculate the mean of a vector of numbers
var Calculate the variance of a vector of numbers
sqrt Calculate the square-root of a number or vector of numbers
length Calculate how many elements there are in a vector of numbers
t.test Conduct a t-test
lm Estimate a linear regression model

Some of these functions are explained in more detail below. Remember, if you want to know how to use a particular function you can type ?function_name or help(function_name), or you can Google it!

Setting up the Working Directory

You should start your script each week with code similar to the following:

rm(list = ls())
  • rm(list = ls()) is just telling R to remove everything from your current environment. For instance, if you create an object like we did in last week’s seminar, and then you run rm(list = ls()), that object will disappear from the environment panel in RStudio and you will no longer be able to access it. We normally put this line at the top of each script we work with so that we are beginning our analysis fresh each time.
  • setwd("path_to_my_PUBL0050_folder") tells R that you would like to work from (“set”) the folder (or, “working directory”) of your choice. For example, I am keeping the code for this week in my PUBL0050 folder, which is in my Teaching folder, which is stored in my Dropbox folder. So I would use setwd("~/Dropbox/Teaching/PUBL0050").
  • Set up a subfolder called data within your PUBL0050 folder

As our running example for the seminar, we will use (a simplified version of) the data from Chattopadhyay and Duflo (2004). We will also be using the data from the Gerber, Green and Larimer (2008) study on social pressure and turnout for the homework. Download these datasets from the links at the top of the page, then put them into the data folder that you just created.

2.1.1 Female politicians and policy outcomes – Chattopadhyay and Duflo (2004)

Chattopadhyay and Duflo ask whether there is a causal effect of having female politicians in government on public policy outcomes. That is, they ask whether women promote different policies than men. Cross-sectional comparisons – i.e. comparisons between political authorities with male and female leaders – are unlikely to result in unbiased estimates of the causal effect of interest, because different types of political areas are likely to differ in many ways other than just the gender of the political leader. For example, it is probably the case that more liberal districts will, on average, elect more female politicians, and so any difference in policy outcomes might be attributable to either politicians’ gender, or to district ideology.

To overcome this problem, Chattopadhyay and Duflo rely on the fact that in the mid-1990s, one-third of local councils in India (known as Gram Panchayat, or GPs) were randomly assigned to be “reserved” for leadership by female politicians. For each of these councils, the authors selected two villages to measure outcomes about public policy. We will study this data below. Once you have downloaded the data and saved it to your computer, set your working directory to the folder in which that file is stored and then load the women.csv file into R using the read.csv function:

women <- read.csv("data/women.csv")

As you will see, there are 6 variables in this data.frame:

Variable name Description
GP Indicator for “Gram Panchayat”, the level of local government studied
village Indicator for villages within GP
reserved Indicator for whether the GP was “reserved” for a female council head
female Indicator for whether the council head was female
irrigation Number of new or repaired irrigation systems in the village since new leader
water Number of new or repaired drinking water systems in the village since new leader

For the following questions, try writing the relevant code to answer the question without looking at the solutions.

  1. Check whether or not the reservation policy was effectively implemented by seeing whether those GPs that were reserved did in fact have female politicians elected. Specifically, calculate the proportion of female leaders elected for reserved and unreserved GPs. What do you conclude?
    Code Hint: You will need to use the subsetting operators that we used last week.
## Calculate the mean of female for those observations that were "reserved"
mean(women$female[women$reserved == 1])
## [1] 1
## Calculate the mean of female for those observations that were "unreserved"
mean(women$female[women$reserved == 0])
## [1] 0.07476636
## An alternative way to look at this is with prop.table and table
##              0          1
##   0 0.92523364 0.07476636
##   1 0.00000000 1.00000000

The reservation policy appears to have been followed correctly. All reserved GPs are lead by womem. This contrasts with only 7.5% of unreserved GPs.

  1. Calculate the estimated average treatment effect of reserved GPs for both irrigation and water.
## ATE drinking-water facilities
water_ate <- mean(women$water[women$reserved == 1]) -
   mean(women$water[women$reserved == 0])

## ATE irrigation facilities
irrigation_ate <- mean(women$irrigation[women$reserved == 1]) -
   mean(women$irrigation[women$reserved == 0])

## [1] 9.252423
## [1] -0.3693319

On average, there were 9.3 new water drinking facilities in reserved villages than unreserved villages. By contrast, there were 0.4 fewer irrigation facilities in reserved villages.

  1. Calculate the standard error of the difference in means for both irrigation and water. Code Hint: You can calculate the variance of a vector by using the var function. Remember also that to subset a vector you can use square parentheses: my_vector[1:10]. Finally, the length function will allow you to calculate how many elements there are in any vector, or any subset of a vector.

Recall that \(\widehat{SE}_\text{ATE} = \sqrt{\frac{\sigma_1^2}{N_1} + \frac{\sigma_0^2}{N_0}}\)

# Calculate the number of observations in the treatment and control groups
n_treat <- length(women$water[women$reserved == 1])
n_control <- length(women$water[women$reserved == 0])

## Calculate the standard error for the drinking-water facilities ATE
water_se <- sqrt(
   (var(women$water[women$reserved == 1])/n_treat) +
      (var(women$water[women$reserved == 0])/n_control)

## Calculate the standard error for the  irrigation facilities ATE
irrigation_se <- sqrt(
   (var(women$irrigation[women$reserved == 1])/n_treat) +
      (var(women$irrigation[women$reserved == 0])/n_control)

## [1] 5.100282
## [1] 0.9674094
  1. Using the values you have just computed, calculate the test statistics for the difference in means and conduct a hypothesis test against the null hypothesis that the average treatment effect of a female-lead council is zero (again, for both irrigation and water). Assume that the sampling distribution of the test statistic under the null hypothesis is well approximated by the standard normal distribution. Conduct your test at the 95% confidence level.

## Calculate the t-statistics
water_t_stat <- water_ate/water_se
irrigation_t_stat <- irrigation_ate/irrigation_se

## [1] 1.8141
## [1] -0.3817742

The test-statistics are both below 1.96 which is the critical value of the standard normal distribution at the 95% confidence level (i.e. when \(\alpha = 0.05\)). We therefore fail to reject the null hypothesis of no effect (though it is pretty close for the water outcome variable!).

  1. Calculate the confidence intervals for these differences in means.
# Calculate the confidence intervals
water_upper_bound <- water_ate + 1.96*water_se
water_lower_bound <- water_ate - 1.96*water_se

irrigation_upper_bound <- irrigation_ate + 1.96*irrigation_se
irrigation_lower_bound <- irrigation_ate - 1.96*irrigation_se

# Present the results in a data.frame
out <- data.frame(outcome = c("Water","Irrigation"),
   ate = c(water_ate,irrigation_ate), 
   upper_ci = c(water_upper_bound,irrigation_upper_bound), 
   lower_ci = c(water_lower_bound, irrigation_lower_bound))

##      outcome        ate  upper_ci   lower_ci
## 1      Water  9.2524230 19.248977 -0.7441306
## 2 Irrigation -0.3693319  1.526791 -2.2654545
  1. What do the conclusions of these tests suggest about the effects of female leadership on policy outcomes?

The reservation policy had no effect on the number of irrigation systems in villages, as the difference in means is very small. The reservation policy seems to have had a modest positive effect on the number of drinking water facilities. In particular, our best estimate of the average treatment effect suggests that the reservation policy increased the number of drinking water facilities in a GP by about 9 on average. That said, the estimates are sufficiently uncertain that we cannot dismiss the null hypothesis of no effect at the 95% confidence level for either of the outcome variables.

T-tests in R

It is relatively laborious to go through those steps each time you want to conduct a hypothesis test, and so normally we would just use in functions built into R that allow us to do this more easily. The syntax for the main arguments for specifying a T-test in R is:

t.test(x, y, alt, mu, conf)

Lets have a look at the arguments.

Arguments Description
x A vector of values from one group of observations
y A vector of values from a different group of observations
mu The value for the difference in means null hypothesis. The default value is 0, but could take on other values if required
alt There are two alternatives to the null hypothesis that the difference in means is zero. The difference could either be smaller or it could be larger than zero. To test against both alternatives, we set alt = "two.sided".
conf Here, we set the level of confidence that we want in rejecting the null hypothesis. Common confidence intervals are: 95%, 99%, and 99.9%.
  1. Using the t.test function, check that your answer to question 4 is correct. That is, use the t.test function to conduct hypothesis tests that the ATE of a female-led council is zero for both irrigation and drinking water investment.
t.test(x = women$water[women$reserved==1], 
       y = women$water[women$reserved==0],
       mu = 0,
       alt = "two.sided",
       conf = 0.95)
##  Welch Two Sample t-test
## data:  women$water[women$reserved == 1] and women$water[women$reserved == 0]
## t = 1.8141, df = 122.05, p-value = 0.07212
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8440572 19.3489031
## sample estimates:
## mean of x mean of y 
##  23.99074  14.73832
t.test(x = women$irrigation[women$reserved==1], 
       y = women$irrigation[women$reserved==0],
       mu = 0,
       alt = "two.sided",
       conf = 0.95)
##  Welch Two Sample t-test
## data:  women$irrigation[women$reserved == 1] and women$irrigation[women$reserved == 0]
## t = -0.38177, df = 306.96, p-value = 0.7029
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.272925  1.534261
## sample estimates:
## mean of x mean of y 
##  3.018519  3.387850

The p-values for the difference in means using the t.test function confirm the results we calculated manually above. The t.test p-value for the water ATE is 0.07, suggesting that we just fail to reject the null at the 95% level. The t.test p-value for the irrigation ATE is , which confirms that there is no clear treatment effect for this outcome variable.

Linear regression in R

Another approach to analysing experimental data is to specify a linear regression where we model our two outcome variables (irrigation,water) as a function of the treatment variable (reserved). Recall that in this setup, the estimated coefficient on the treatment variable will be equal to the difference in means we calculated above (and the standard error, confidence intervals, and p-values will also all follow through as above).

We run linear regressions using the lm() function in R (lm stands for Linear Model). The lm() function needs to know a) the relationship we’re trying to model and b) the dataset for our observations. The two arguments we need to provide to the lm() function are described below.

Argument Description
formula The formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable
data The name of the dataset that contains the variable of interest.

For more information on how the lm() function works, type help(lm) in R.

  1. Specify linear models for water and irrigation as a function of reserved. Assign the output of these models to objects with sensible names. Use the summary function on these objects to examine the coefficients, standard errors and p-values.
# Estimate linear models
water_lm <- lm(water ~ reserved, data = women)
irrigation_lm <- lm(irrigation ~ reserved, data = women)

# Summarize output
## Call:
## lm(formula = water ~ reserved, data = women)
## Coefficients:
## (Intercept)     reserved  
##      14.738        9.252
## Call:
## lm(formula = irrigation ~ reserved, data = women)
## Coefficients:
## (Intercept)     reserved  
##      3.3879      -0.3693

The regression estimate of the difference in means (i.e. \(\beta_\text{reserved}\)) is 9.3 for the drinking water outcome and -0.4 for the irrigation outcome, which are the same as the manually calculated differences.

2.1.2 Reanalysis of Gerber, Green and Larimer (2008)

‘Why do large numbers of people vote, despite the fact that, as Hegel once observed, “the casting of a single vote is of no significance where there is a multitude of electors”?’

This is the question that drives the experimental analysis of Gerber, Green and Larimer (2008). If it is irrational to vote because the costs of doings so (time spent informing oneself, time spent getting to the polling station, etc) are clearly greater than the gains to be made from voting (the probability that any individual voter will be decisive in an election are vanishingly small), then why do we observe millions of people voting in elections? One commonly proposed answer is that voters may have some sense of civic duty which drives them to the polls. Gerber, Green and Larimer investigate this idea empirically by priming voters to think about civic duty while also varying the amount of social pressure voters are subject to.

In a field experiment in advance of the 2006 primary election in Michigan, nearly 350,000 voters were assigned at random to one of four treatment groups, where voters received mailouts which encouraged them to vote, or a control group where voters received no mailout. The treatment and control conditions were as follows:

  • Treatment 1 (“Civic duty”): Voters receive mailout reminding them that voting is a civic duty
  • Treatment 2 (“Hawthorne”): Voters receive mailout telling them that researchers would be studying their turnout based on public records
  • Treatment 3 (“Self”): Voters receive mailout displaying the record of turnout for their household in prior elections.
  • Treatment 4 (“Neighbors”): Voters receive mailout displaying the record of turnout for their household and their neighbours’ households in prior elections.
  • Control: Voters receive no mailout.

Load the replication data for Gerber, Green and Larimer (2008). This data is stored in a .Rdata format, which is the main way to save data in R. Therefore you will not be able to use read.csv but instead should use the function load.


Once you have loaded the data, familiarise yourself with the the gerber object which should be in your current envionment. Use the str and summary functions to get an idea of what is in the data. There are 5 variables in this data.frame:

Variable name Description
voted Indicator for whether the voter voted in the 2006 election (1) or did not vote (0)
treatment Factor variable indicating which treatment arm (or control group) the voter was allocated to
sex Sex of the respondent
yob Year of birth of the respondent
p2004 Indicator for whether the voter voted in the 2004 election (Yes) or not (No)
  1. Calculate the turnout rates for each of the experimental groups (4 treatments, 1 control). Calculate the number of individuals allocated to each group. Recreate table 2 on p. 38 of the paper.

Here is one (somewhat laborious) way of constructing the table:

## Calculate the mean outcome for each condition
y_bar_control <- mean(gerber$voted[gerber$treatment == "Control"])
y_bar_civic <- mean(gerber$voted[gerber$treatment == "Civic Duty"])
y_bar_hawthorne <- mean(gerber$voted[gerber$treatment == "Hawthorne"])
y_bar_self <- mean(gerber$voted[gerber$treatment == "Self"])
y_bar_neighbor <- mean(gerber$voted[gerber$treatment == "Neighbors"])

## Calculate the total number of observations for each condition
n_control <- sum(gerber$treatment == "Control")
n_civic <- sum(gerber$treatment == "Civic Duty")
n_hawthorne <- sum(gerber$treatment == "Hawthorne")
n_self <- sum(gerber$treatment == "Self")
n_neighbor <- sum(gerber$treatment == "Neighbors")

## Concatenate into two vectors (using "round" to round the percentages to one decimal place)
percentages <- round(c(y_bar_control,y_bar_civic,y_bar_hawthorne,
                       y_bar_self, y_bar_neighbor)*100,1)

totals <- c(n_control, n_civic, n_hawthorne, n_self, n_neighbor)

## Combine into a data.frame object
table_two <- data.frame(rbind(percentages, totals))

## Provide the correct names
rownames(table_two) <- c("Percentage voting", "N of individuals")
colnames(table_two) <- c("Control", "Civic Duty", "Hawthorne", "Self", "Neighbors")

##                    Control Civic Duty Hawthorne    Self Neighbors
## Percentage voting     29.7       31.5      32.2    34.5      37.8
## N of individuals  191243.0    38218.0   38204.0 38218.0   38201.0

Here is an alternative way that is more efficient, but the code may be less readable and take more work to figure out what is going on:

## Calculate the mean outcome for each condition using the aggregate function
y_bars <- aggregate(gerber$voted, list(gerber$treatment),
                    FUN = function(x) round(mean(x)*100,1))

## Calculate the number of observations for each condition using the table function
ns <- table(gerber$treatment)

## Combine into a data.frame object
table_two2 <- data.frame(rbind(t(y_bars),ns)[2:3,])

## Provide the correct names
rownames(table_two2) <- c("Percentage voting", "N of individuals")

# Uncomment this to see that it creates the same table as above
# print(table_two2) 

For those who are motivated, you can use the package kableExtra to faithfully recreate the table. Below is the code for it. You can find a lot of help for the package online, and especially here.


# add the percentage signs
table_two[1,1:5] <- paste0(table_two[1,1:5],"%") 

kable(table_two) %>% # You may recognise the pipe symbol from the tidyverse 
   kable_paper() %>% # There are a couple of different themes to choose from
   add_header_above(c(" ", "Experimental Group"=5)) %>%
     c("TABLE 2. Effects of Four Mail Treatments on Voter Turnout in the August 2006 Primary Election"=6), 
     align="l", bold=T, font_size=20)
TABLE 2. Effects of Four Mail Treatments on Voter Turnout in the August 2006 Primary Election
Experimental Group
Control Civic Duty Hawthorne Self Neighbors
Percentage voting 29.7% 31.5% 32.2% 34.5% 37.8%
N of individuals 191243 38218 38204 38218 38201
  1. Conduct a series of t-tests between each treatment condition and the control condition. Present the results of the t-tests either as confidence intervals for the difference in means, or as a p-value for the null hypothesis that \(\hat{Y}_c = \hat{Y}_t\).
t.test(x = gerber$voted[gerber$treatment == "Civic Duty"], 
   y = gerber$voted[gerber$treatment == "Control"])$conf.int
## [1] 0.01281368 0.02298501
## attr(,"conf.level")
## [1] 0.95
t.test(x = gerber$voted[gerber$treatment == "Hawthorne"], 
   y = gerber$voted[gerber$treatment == "Control"])$conf.int
## [1] 0.02062181 0.03085081
## attr(,"conf.level")
## [1] 0.95
t.test(x = gerber$voted[gerber$treatment == "Self"], 
   y = gerber$voted[gerber$treatment == "Control"])$conf.int
## [1] 0.04332558 0.05370080
## attr(,"conf.level")
## [1] 0.95
t.test(x = gerber$voted[gerber$treatment == "Neighbors"], 
   y = gerber$voted[gerber$treatment == "Control"])$conf.int
## [1] 0.07603405 0.08658577
## attr(,"conf.level")
## [1] 0.95

In all cases, the difference between the treatment and control condition is statistically significant at the 95% level.

  1. Use the following code to create three new variables in the data.frame. First, a variable that is equal to 1 if a respondent is female, and 0 otherwise. Second, a variable that measures the age of each voter in years at the time of the experiment (which was conducted in 2006). Third, a variable that is equal to 1 if the voter voted in the November 2004 Midterm election.
## Female dummy variable
gerber$female <- ifelse(gerber$sex == "female", 1, 0)

## Age variable
gerber$age <- 2006 - gerber$yob

## 2004 variable
gerber$turnout04 <- ifelse(gerber$p2004 == "Yes", 1, 0)

Using these variables, conduct balance checks to establish whether there are potentially confounding differences between treatment and control groups.

## Balance
m1 <- lm(female ~ treatment, data = gerber)
m2 <- lm(age ~ treatment, data = gerber)
m3 <- lm(turnout04 ~ treatment, data = gerber)

# Presenting
          title = "Balance Checks",
          keep.stat = c("n","adj.rsq"),
          dep.var.caption = "",
          dep.var.labels = c("Gender","Age","Turnout 2004"),
          intercept.bottom = F,
          intercept.top = T,
          covariate.labels = levels(as.factor(gerber$treatment)),
          star.cutoffs = c(.05,.01,.001))
Balance Checks
Gender Age Turnout 2004
(1) (2) (3)
Control 0.499*** 49.814*** 0.400***
(0.001) (0.033) (0.001)
Civic Duty 0.001 -0.155 -0.001
(0.003) (0.081) (0.003)
Hawthorne 0.0001 -0.109 0.003
(0.003) (0.081) (0.003)
Self 0.001 -0.021 0.002
(0.003) (0.081) (0.003)
Neighbors 0.001 0.039 0.006*
(0.003) (0.081) (0.003)
Observations 344,084 344,084 344,084
Adjusted R2 -0.00001 0.00000 0.00001
Note: p<0.05; p<0.01; p<0.001

Looking at these three pre-treatment covariates, there is little evidence of imbalance across the treatment and control groups. There are no significant gender or age differences between the control group and any of the treatment groups. There is some evidence a slightly higher proportion of voters turned out to vote in 2004 in the “Neighbors” treatment condition than in the control group (i.e. \(p < 0.05\)), but the difference is very small: turnout was about a half a percentage point higher in the treatment group than the control group (where turnout was about 40%). Overall, these tables do not indicate any failures of randomization.

  1. Estimate the average treatment effects of the different treatment arms whilst controlling for the variables you created for the question above. How do these estimates differ from regression estimates of the treatment effects only (i.e. without controlling for other factors)? Why?
# Estimate a baseline model
baseline_model <- lm(voted ~ treatment, data = gerber)

# Estimate a model with covariates
covariate_model <- lm(voted ~ treatment + female + age + turnout04, data = gerber)

# Table with only our treatment effects
          dep.var.caption = "",
          dep.var.labels = "Turnout",
          column.labels = c("Baseline","w/ Covariates"),
          keep = c("Constant","treatment"),
          keep.stat = c("n","adj.rsq"),
          intercept.bottom = F,
          intercept.top = T,
          covariate.labels = levels(as.factor(gerber$treatment)),
          star.cutoffs = c(.05,.01,.001))
Baseline w/ Covariates
(1) (2)
Control 0.297*** 0.044***
(0.001) (0.003)
Civic Duty 0.018*** 0.019***
(0.003) (0.003)
Hawthorne 0.026*** 0.026***
(0.003) (0.003)
Self 0.049*** 0.048***
(0.003) (0.003)
Neighbors 0.081*** 0.080***
(0.003) (0.003)
Observations 344,084 344,084
Adjusted R2 0.003 0.045
Note: p<0.05; p<0.01; p<0.001

As expected from a randomized experiment, controlling for pre-treatment covariates has very little consequence for the estimated treatment effects. Because the covariates are balanced in expectation (and in this exact randomization there is also very little imbalance across the treatment arms), estimating the treatment effects conditional on covariates results in very similar estimates as the baseline estimates.

  1. Estimate the treatment effects separately for men and women. Do you note any differences in the impact of the treatment amongst these subgroups?

There are two ways of estimating these effects separately for men and women. First, you could simply estimate the same model on different subsets of the data:

# Estimate regression models on subsets of data
male_model <- lm(voted ~ treatment, data = gerber[gerber$female == 0,])
female_model <- lm(voted ~ treatment, data = gerber[gerber$female == 1,])

# Construct a data.frame with the treatment coefficients from each
coef_compare <- data.frame(male = coef(male_model)[2:5], 
                        female = coef(female_model)[2:5])
##                           male     female
## treatmentCivic Duty 0.01994637 0.01588446
## treatmentHawthorne  0.02468701 0.02679139
## treatmentSelf       0.04575431 0.05129251
## treatmentNeighbors  0.08174818 0.08089951

The treatment effects are in fact very similar between men and women. The largest difference in effect size is the “Self” treatment condition, but even here the difference is only one half of a percentage point. Both men and women seem equally likely to respond to appeals to civic duty and social pressure when making the decision to turn out to vote.

An alternative approach is to include an interaction between the treatment variable and the female variable:

interaction_model <- lm(voted ~ treatment*female, data = gerber)
## Call:
## lm(formula = voted ~ treatment * female, data = gerber)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3845 -0.3172 -0.2905  0.6583  0.7095 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.3027947  0.0014991 201.986  < 2e-16 ***
## treatmentCivic Duty         0.0199464  0.0036770   5.425 5.81e-08 ***
## treatmentHawthorne          0.0246870  0.0036740   6.719 1.83e-11 ***
## treatmentSelf               0.0457543  0.0036752  12.450  < 2e-16 ***
## treatmentNeighbors          0.0817482  0.0036774  22.230  < 2e-16 ***
## female                     -0.0123389  0.0021223  -5.814 6.11e-09 ***
## treatmentCivic Duty:female -0.0040619  0.0052002  -0.781    0.435    
## treatmentHawthorne:female   0.0021044  0.0052010   0.405    0.686    
## treatmentSelf:female        0.0055382  0.0052002   1.065    0.287    
## treatmentNeighbors:female  -0.0008487  0.0052012  -0.163    0.870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.464 on 344074 degrees of freedom
## Multiple R-squared:  0.003569,   Adjusted R-squared:  0.003542 
## F-statistic: 136.9 on 9 and 344074 DF,  p-value: < 2.2e-16

You can also plot the predicted values depending on treatment and gender with the following code and the package sjPlot. It uses ggplot2-grammar, so it’s easy to customise it.

library(wesanderson) # to make colours interesting

# Change the female variable so it has meaningful labels
gerber$female <- as.factor(gerber$female)
levels(gerber$female) <- c("Men","Women")
interaction_model <- lm(voted ~ treatment*female, data = gerber)

# Plot the model
           terms = c("treatment","female"),
           colors = c(wes_palette("GrandBudapest2")[1],
                       wes_palette("GrandBudapest2")[2])) +
   labs(y = "Estimated probability of turnout", x = "", title = "") +
   theme_bw() +
   guides(color = guide_legend(""))

The effect of each treatment condition relative to the control condition for men is simply the coefficient associated with each treatment indicator. So, the effect of the “Civic Duty” treatment on men is that it increased turnout by 0.01995. Note that this is exactly the same as the effect we estimated using the male_model above.

The effect of each treatment condition for women is calculated by taking the sum of the coefficient associated with each treatment indicator and the coefficient associated with the interaction between that indicator and the female variable. So, for example, the effect of the “Civic Duty” treatment for women is 0.01995 - 0.0041 = 0.01585. Again, this is directly equivalent (up to rounding error) to the effect size we calculated for the female_model above.

The advantage of using the interaction model is that we can directly assess, from a statistical perspective, whether the differences in treatment effects between men and women are significant. We can see from the regression output that they are not: none of the interaction effects is significantly different from zero (the t-statistics are very small, and the p-values are large), which implies that the treatments are equally effective for people of both genders.

2.2 Quiz

  1. What does the expression \(E[Y_{1i}] = E[Y_i | D_i = 1]\) mean?
  1. That the observed outcomes of the treatment group are representative of the population of treated potential outcomes
  2. That the potential outcomes of the treatment group are representative of the population of treated observed outcomes
  3. That expectations are linear
  4. That the expected values of the untreated potential outcomes in treatment and control groups are different
  1. What does “unbiasedness” of an estimator mean?
  1. That if I iterated the sampling procedure infinitely, the sampling distribution would converge around the true value
  2. That if I iterated the sampling procedure infinitely, the mean of the sampling distribution would be the true value
  3. That if I iterated the sampling procedure infinitely, the variance of the sampling distribution would be small
  4. That if I used it to estimate a parameter I would guess its true value
  1. When we refer to the sampling distribution of an estimator in the context of an experiment, what is being iteratively sampled?
  1. The units selected in the experiment
  2. The timing of the experiment
  3. The random assignment of units to treatment, in the sample
  4. The random assignment of units to treatment, and the sample
  1. What can we use covariates for, in the context of an experiment?
  1. To reduce bias of our estimated treatment effect and control for confounders
  2. To perform balance checks, increase precision of our estimates, and estimate heterogeneous treatment effects
  3. To perform balance checks and estimate heterogeneous treatment effects
  4. Covariates have no use in an experiment
  1. What is the issue of selection bias in causal inference?
  1. The treatment applied is ideologically biased
  2. The units in our sample are biased
  3. That units selected into treatment are fundamentally different than units selected into control and therefore our estimated treatment effect will be biased (i.e. systematically wrong)
  4. The researcher is biased