2 Causality

2.1 Overview

In the lecture this week, we discuss the concept of causality and particularly focus on distinguishing between observational and experimental strategies for making causal claims from quantitative data. We will introduce the “potential outcomes” framework for thinking about causal inference, and describe the “fundamental problem of causal inference”. After describing why randomized experiments are considered the gold standard for estimating causal effects, we will outline different strategies for using observational data to answer causal questions. The example we will use throughout the lecture will be about the effects of health insurance on self-assessed measures of health.

In seminar this week, we will cover the following topics:

  1. Folders and files
  2. Loading data using read.csv()
  3. Missing data

We will also spend time thinking about the assumptions required to make causal claims from analyses of observational data.

Before coming to the seminar

  1. Please read chapter 2, “Causality”, in Quantitative Social Science: An Introduction

2.2 Seminar

Does a state’s use of indiscriminate violence incite insurgent attacks? In today’s seminar, we will analyse the relationship between indiscriminate violence and insurgent attacks using data about Russian artillery fire in Chechnya from 2000 to 2005. The data in this exercise is based on Lyall, J. 2009. “Does Indiscriminate Violence Incite Insurgent Attacks?: Evidence from Chechnya.”. You can download the data by clicking on the link above.

2.2.1 Preliminaries

Files and folders

It is sensible when you start any data analysis project to make sure your computer is set up in an efficient way. Last week, you should have created a script with the name seminar1.R. Hopefully, you will have saved this somewhere sensible! Our suggestion is that you create a folder on your computer with the name PUBL0055 which you can save all your scripts in throughout the course. If you didn’t set up a folder like that on your computer, do so now.

Once you have a folder with the name PUBL0055, make sure your seminar1.R file is saved within it. Now create a new R script, and save that in your folder with the name seminar2.R. Use this script to record all the code that you are working on this week. Each week you should start a new script and save it in this folder.

This week we will also be loading some data into R for analysis. Let’s add a subfolder into your main PUBL0055 folder, and give it the name data. The contents of your folder should now look something like this if you are working on a Mac:

And like this if you are working on a PC:

Working directories

If you open Rstudio, the first thing you should include at the beginning of your script is code to set the “working directory”. This tells R where to look for scripts and data when you run your code. If you are working on a Mac and have saved your PUBL0055 folder on the desktop of your computer, for example, then you can tell R to work from that folder by running the following code:


If you were working on a Windows PC, you might use:


You can adjust the code above to direct R to look to wherever the relevant folder is stored on your computer. For instance, if your PUBL0055 folder is kept inside your UCL folder, you could use setwd("~/Desktop/UCL/PUBL0055"), and so on.

Loading data

Once you have downloaded the data, put the .csv into the data folder that you created earlier in the seminar, and then load the R script that you are using for this week. Now load the data using the read.csv() function:

chechen <- read.csv("data/chechen.csv")

This function loads the data stored in "chechen.csv" into R, and then we are using the assignment operator - <- - that we learned last week to create a new object. Once the data is loaded, you should see the chechen object appear in the Environment pane of Rstudio:

As you can see from the output above, this data has 318 rows (units), and 6 columns (variables). We will describe these below.

2.2.2 Indiscriminate Violence and Insurgency

A common view is that indiscriminate violence on behalf of the state can lead to increases in insurgent attacks by creating more cooperative relationships between citizens and insurgents. In particular, there is a large literature that relies on case-study evidence to suggests that when a state collectively targets a noncombatant population, this can provoke much greater levels of insurgent violence. An empirical difficulty in answering this question is that places that are subject to state-sponsored indiscriminate violence are likely to differ in many ways from places that are not subject to such violence.

In an attempt to overcome this problem (which is an example of the confounding bias that we discussed in the lecture), Lyall collected data on 159 events in which Russian artillery shelled a village. For each such event the data records the village where the shelling took place and whether it was in Groznyy (Chechnya’s capital), how many people were killed, and the number of insurgent attacks 90 days before and 90 days after the date of the event. We then augment this data by observing the same information for a set of demographically and geographically similar villages that were not shelled during the same time periods. The main explanatory variable used in Lyall’s analysis is therefore whether or not a village was struck by artillery fire by Russian forces – what Lyall interprets as an instance of indiscriminate violence.

The names and descriptions of variables in the data file chechen.csv are

Name Description
village Name of village
groznyy Variable indicating whether a village is in Groznyy (1) or not (0)
fire Whether Russians struck a village with artillery fire (1) or not (0)
deaths Estimated number of individuals killed during Russian artillery fire or NA if not fired on
preattack The number of insurgent attacks in the 90 days before being fired on
postattack The number of insurgent attacks in the 90 days after being fired on

Note that the same village may appear in the dataset several times as shelled and/or not shelled because Russian attacks occurred at different times and locations.

To get a sense of what this data.frame contains, use the functions below:

  1. head(chechen) – shows the first six (by default) rows of the data
  2. str(chechen) – shows the “structure” of the data
  3. View(chechen) – opens a spreadsheet-style viewer of the data

Question 1

For this question, we will learn to use the table() function, which provides counts of the number of observations in our data that take distinct values for a given variable or pair of variables.

The table function can be used to provide the number of respondents that fall into a given category for a single variable. To do this, simply provide the name of the variable of interest as the first argument to the table function:


Or it can be used to provide the number of respondents that fall into the combination of categories of two different variables. To do this, provide both variable names to the table function:

table(data_name$variable_name_one, data_name$variable_name_two)

Look at the help file for this function for more information (?help).

Use this function to answer the following questions:

  1. How many of villages were shelled by Russians? How many were not?
  2. How many villages were located in Groznyy? How many were not?
  3. Of the villages that were shelled by the Russians, how many were located in Groznyy?
  4. Of the villages that were not shelled by the Russians, how many were located in Groznyy?

Reveal answer


##   0   1 
## 159 159

Using the table function shows an equal number of shelled and not-shelled villages.


##   0   1 
## 298  20

298 villages were located outside of Groznyy and 20 were located within Groznyy.

3 and 4

table(chechen$fire, chechen$groznyy)
##       0   1
##   0 146  13
##   1 152   7

When applied to two variables, the counts for the first variable are indicated by the rows of the resulting matrix, and the counts for the second variable are indicated by the columns. So, in this case, of the 159 villages that were shelled by the Russians (second row), 152 were located outside of Groznyy and 7 were located in Groznyy. By contrast, 146 of the non-shelled villages were outside of Groznyy and 13 were located within Groznyy.

Question 2

In this question, we will investigate whether artillery attacks on villages in Groznyy were more lethal than attacks on villages outside of Groznyy. Note that for this question, you will have to use the mean function to calculate the mean of various subsets of the data. However, you will find that if you simply apply the mean function to the chechen$deaths variable it does not produce the desired result:

## [1] NA

R returns NA because for some villages (those that were not subject to Russian attacks) we have no information on the number of deaths. In short, NA is the R value for missing data. We can tell the mean() function to estimate the mean only for those villages that we do have data for, and ignore the other villages by setting an additional argument for the mean function: mean(dataset_name$var_name, na.rm = TRUE).

mean(chechen$deaths, na.rm = T)
## [1] 1.666667

Conduct comparisons between Groznyy and non-Groznyy observations in terms of the mean level of deaths. What do you find?

Reveal answer

mean(chechen$deaths[chechen$groznyy == 1], na.rm = T)
mean(chechen$deaths[chechen$groznyy == 0], na.rm = T)
## [1] 3.714286
## [1] 1.572368

Artillery attacks killed on average 3.71 people in Groznyy but only 1.57 people from villages outside Grozny.

Question 3

Compare the average (mean) number of insurgent attacks for shelled villages and non-shelled villages using the postattack variable. Would you conclude that indiscriminate violence reduces insurgent attacks? Why or why not?

Reveal answer

attacks_shelled <- mean(chechen$postattack[chechen$fire == 1])
attacks_not_shelled <- mean(chechen$postattack[chechen$fire == 0])

attacks_shelled - attacks_not_shelled
## [1] -0.5534591

The estimated difference in means reveals that while shelled villages see slightly fewer insurgent attacks than non-shelled villages, this difference is not large. The average number of insurgent attacks is 1.5 for observations of villages that were shelled vs 2.05 for the others. By itself, this comparison suggests that indiscriminate violence may slightly reduce insurgent attacks though the effect is not large.

Question 4

In the question above, we used the variable fire to calculate the difference in means for the number of insurgent attacks in villages that were and were not attacked by the Russians. Is this difference in means likely to represent the causal effect of indiscriminate violence? Why or why not? Which assumptions are required to give this difference a causal interpretation? Give some thought to these questions, and write down your reasoning before reading the answer below.

Reveal answer

Recall our discussion of confounding from the lecture. There we argued that making causal statements on the basis of evidence drawn from observational studies is difficult because confounding differences between treatment and control observations mean that the difference in means can result in a biased estimate of the average treatment effect. In essence, to make a causal statement on the basis of observational evidence of this type, one needs to assume that there are no confounding differences between treatment and control groups. That is, the only way in which these groups differ on average is with respect to whether or not they were subject to indiscriminate violence.

In this case, are there any plausible sources of confounding? Put another way, are the places that experienced Russian artillery fire likely to be similar on all characteristics other than their receipt of the “treatment” of being shelled? At face value this seems unlikely, as there are many dimensions on which these groups of villages may differ. Unfortunately, it is not possible for us to assess the extent of these possible differences here, as we have not provided data on other characteristics of the villages. Although the set of non-shelled villages was selected by the researcher in a way to make them as demographically and geographically similar as possible to the set of shelled villages, we should remember that it is difficult to rule out the possibility of confounding bias in observational data of this sort.

Question 5

Considering only the pre-shelling periods, what is the difference between the average number of insurgent attacks for observations describing a shelled village and observations that do not? What does this suggest to you about the validity of comparison used for question 2?

Reveal answer

pre_attacks_shelled <- mean(chechen$preattack[chechen$fire == 1])
pre_attacks_not_shelled <- mean(chechen$preattack[chechen$fire == 0])

## [1] 2.113208
## [1] 2.150943

Despite the fact that we cannot rule out the possibility of confounding, the evidence here is encouraging. In these periods, the average number of insurgency attacks was similar for villages that were later shelled and villages that were not shelled. If systematic differences between the two groups were confounding the treatment effects estimated in question 2, we would also expect there to be differences in the number of insurgency attacks in the period before the Russian shelling. As a consequence, although this is an observational study where artillery fire is not formally randomized, the similarity of these averages increases the credibility of the comparison of insurgency attacks between villages hit by Russian fire and those not.

Question 6

Create a new variable called diffattack by calculating the difference in the number of insurgent attacks in the before and after periods using the following code:

chechen$diffattack <- chechen$postattack - chechen$preattack

Here we are using the assignment operator to create a new variable in our chechen data. Positive values of this variable would indicate that the number of insurgent attacks increased after Russian shelling, and negative values indicate that the number of insurgent attacks decreased after shelling.

Using this variable, assess whether, for the villages that were shelled, the number of insurgent attacks increased after the villages were fired upon. What is the substantive interpretation of this result? Does this represent the causal effect of shelling on insurgent attacks? Why or why not?

Reveal answer

mean(chechen$diffattack[chechen$fire == 1]) 
## [1] -0.6163522

In villages that experienced shelling, the average number of insurgent attacks decreased by -0.62 from the period before to the period after the shelling. This is an example of a “before and after” design which examines how the outcome variable changes from the pretreatment period to the posttreatment period for the same set of units. This estimate can only be considered a causal effect if we are willing to assume that – in the absence of the shelling – there would have been no change in the average number of insurgent attacks between these two time periods.

Question 7

Compute the mean difference in the diffattack variable between shelled and non-shelled villages. Does this analysis support the claim that indiscriminate violence reduces insurgency attacks? Is the validity of this analysis improved over the analyses you conducted in the previous questions? Why or why not? Specifically, explain what additional factor this analysis addresses when compared to the analyses conducted in the previous questions.

Reveal answer

diff_shelled <- mean(chechen$diffattack[chechen$fire == 1]) 
diff_not_shelled <- mean(chechen$diffattack[chechen$fire == 0]) 
diff_shelled - diff_not_shelled
## [1] -0.5157233

In the villages that did not experience shelling, the average number of insurgent attacks decreased by -0.1 in the period after shelling. Compare this to the same before-and-after difference for shelled villages, where insurgent attacks decreased by -0.62. In other words, the decrease in insurgent attacks was larger in shelled villages than in non-shelled villages. The results support the conclusion that indiscriminate violence reduces insurgent attacks.

This analysis is an example of a “difference in differences” design. What are the two differences we are comparing here? We have the difference in pre-and-post shelling insurgent attacks in shelled villages, and we are comparing that to the same difference in non-shelled villages. The key advantage of this analysis is that it takes into account any common time trend that exists for the two types of observations. If we assume that the trend in numbers of attacks pre- and post-shelling among villages that were not in fact shelled is what we would have observed in shelled villages had they not been shelled, then the difference in the differences between pre- and post-shelling attacks between the two village types can be attributed to the Russian artillery fire.

2.3 Homework

2.3.1 Demographic Change and Exclusionary Attitudes

This week’s homework uses data based on: Enos, R. D. 2014. “Causal Effect of Intergroup Contact on Exclusionary Attitudes.Proceedings of the National Academy of Sciences 111(10): 3699–3704. You can download this data from the link at the top of the page. Once you have done so, store it in the data subfolder you created earlier. Then start a new R script which you should save as homework2.R.

Enos conducted a randomized field experiment assessing the extent to which individuals living in suburban communities around Boston, Massachusetts, were affected by exposure to demographic change.

Subjects in the experiment were individuals riding on the commuter train line and were overwhelmingly white. Every morning, multiple trains pass through various stations in suburban communities that were used for this study. For pairs of trains leaving the same station at roughly the same time, one was randomly assigned to receive the treatment and one was designated as a control. By doing so all the benefits of randomization apply for this dataset.

The treatment in this experiment was the presence of two native Spanish-speaking ‘confederates’ (a term used in experiments to indicate that these individuals worked for the researcher, unbeknownst to the subjects) on the platform each morning prior to the train’s arrival. The presence of these confederates, who would appear as Hispanic foreigners to the subjects, was intended to simulate the kind of demographic change anticipated for the United States in coming years. For those individuals in the control group, no such confederates were present on the platform. The treatment was administered for 10 days. Participants were asked questions related to immigration policy both before the experiment started and after the experiment had ended. The names and descriptions of variables in the data set boston.csv are:

Name Description
age Age of individual at time of experiment
male Sex of individual, male (1) or female (0)
income Income group in dollars (not exact income)
white Indicator variable for whether individual identifies as white (1) or not (0)
college Indicator variable for whether individual attended college (1) or not (0)
usborn Indicator variable for whether individual is born in the US (1) or not (0)
treatment Indicator variable for whether an individual was treated (1) or not (0)
ideology Self-placement on ideology spectrum from Very Liberal (1) through Moderate (3) to Very Conservative (5)
numberim.pre Policy opinion on question about increasing the number immigrants allowed in the country from Increased (1) to Decreased (5)
numberim.post Same question as above, asked later
remain.pre Policy opinion on question about allowing the children of undocumented immigrants to remain in the country from Allow (1) to Not Allow (5)
remain.post Same question as above, asked later
english.pre Policy opinion on question about passing a law establishing English as the official language from Not Favor (1) to Favor (5)
english.post Same question as above, asked later

Question 1

The benefit of randomly assigning individuals to the treatment or control groups is that the two groups should be similar, on average, in terms of their other characteristics, or “covariates”. This is referred to as “covariate balance.”

Use the mean function to determine whether the treatment and control groups are balanced with respect to the age (age) and income (income) variables. Also, compare the proportion of males (male) in the treatment and control groups. Interpret these numbers.

(Hint: to calculate the proportion of observations with a given attribute on a binary variable, you can just use mean(data_frame_name$variable_name).)

Reveal answer

## Load data

boston <- read.csv("data/boston.csv", head = TRUE)
## Mean age for treatment and control units

mean_age_treated <- mean(boston$age[boston$treatment == 1])
mean_age_control <- mean(boston$age[boston$treatment == 0])

mean_age_treated - mean_age_control
## [1] -3.912299
## Mean income levels for treatment and control units

mean_income_treated <- mean(boston$income[boston$treatment == 1])
mean_income_control <- mean(boston$income[boston$treatment == 0])

mean_income_treated - mean_income_control
## [1] -15972.59
## Proportion "male" for treatment and control units

prop_male_treated <- mean(boston$male[boston$treatment == 1])
prop_male_control <- mean(boston$male[boston$treatment == 0])

prop_male_treated - prop_male_control
## [1] -0.06096257

Despite the randomization of treatment assignment, there are some differences in the average characteristics of treatment and control units. For example, the average age of treated individuals is 40.4, where is is 44.3 for control units. Similarly, while 53% of treated individuals are male, 59% of control individuals are male. Most notably, the average income of the treated group is approximately $16000 lower than it is for the control group.

Overall, while the treatment and control groups are relatively well balanced, there remain some potentially problematic confounding differences between these groups. This is an example of the point made in lecture: although randomized experiments provide unbiased estimates on average, any given instance of randomization may not create perfect balance across all covariates. That is, you might be unlucky! That is why it is often important to run replication studies of randomized experiments to ensure that the results we obtain are not simply because we were lucky/unlucky in any particular randomization of the treatment.

Question 2

Individuals in the experiment were asked “Do you think the number of immigrants from Mexico who are permitted to come to the United States to live should be increased, left the same, or decreased?” The response to this after the experiment is in the variable numberim.post. The variable is coded on a 1 – 5 scale. Responses with values of 1 are inclusionary (‘pro-immigration’) and responses with values of 5 are exclusionary (‘anti-immigration’). Calculate the mean value of this variable for the treatment and control groups. What is the difference in means? What does the result suggest about the effects of intergroup contact on exclusionary attitudes?

Reveal answer

## Calculate the mean in each group (note that na.rm = T is required here)

treat_mean <- mean(boston$numberim.post[boston$treatment == 1], na.rm = T)
control_mean <- mean(boston$numberim.post[boston$treatment == 0], na.rm = T)

## Calculate the difference in means

treat_mean - control_mean 

The difference in means suggests that the treatment group reported, on average, 0.39 points higher on the 5 point scale than the control group. As higher values of the outcome variable suggest more exclusionary attitudes, this suggests that contact with the Spanish speaking confederates increases exclusionary attitudes, at least in this experiment. Because the responses of individuals in the treatment group were more exclusionary than the control group, we would conclude on the basis of this experiment that exposure to potential demographic changes cause increases in exclusionary attitudes.

Question 3

Does having attended college influence the effect of being exposed to ‘outsiders’ on exclusionary attitudes? Another way to ask the same question is this: is there evidence of a differential impact of treatment, conditional on attending college versus not attending college? Calculate the difference in means between treatment and control observations amongst those who attended college and those who did not. Interpret your results.

(Hint: You may want to subset the data using more than one logical condition here. For example, if I wanted to subset the data to include only the observations which were treated and went to college, I could use boston$numberim.post[boston$treatment == 1 & boston$college == 1].)

Reveal answer

## First calculate the mean outcome for treatment and control observations *who went to college*

treat_college_mean <- mean(boston$numberim.post[boston$treatment == 1 & boston$college == 1],
                     na.rm = TRUE)

control_college_mean <- mean(boston$numberim.post[boston$treatment == 0 & boston$college == 1],
                     na.rm = TRUE)

## Now calculate the mean outcome for treatment and control observations *who did not go to college*

treat_nocollege_mean <- mean(boston$numberim.post[boston$treatment == 1 & boston$college == 0],
                     na.rm = TRUE)

control_nocollege_mean <- mean(boston$numberim.post[boston$treatment == 0 & boston$college == 0],
                     na.rm = TRUE)

## Difference in means for college observations

diff_college <- treat_college_mean - control_college_mean

## Difference in means for non-college observations

diff_nocollege <- treat_nocollege_mean - control_nocollege_mean
## [1] 0.4929467
## [1] -0.4285714

The average treatment effect (using the numberim.post variable) among those with a college education is an increase in exclusionary attitudes of about 0.49 points. Among those without a college education, there is a decrease in exclusionary attitudes of about .43 points. Both of these effects are on a 5 point scale. At face value this suggests that the effects of “outgroup” contact on exclusionary attitudes might differ according to education.

Question 4

Calculate the number of observations used to calculate each of the mean outcome values you used in the answer for question 3. What does this suggest about the reliability of the conclusions you drew from that analysis?

Reveal answer

table(boston$treatment, boston$college)
##      0  1
##   0  7 61
##   1  8 47

Using the table function reveals that some of the averages calculated above are based on a very small number of observations. In particular, the vast majority of the data in our sample is of individuals who have college degrees. Only 15 individuals are not college educated. Accordingly, we might worry that the averages we have calculated above for non-college individuals may capture idiosyncrasies of these individuals rather than anything general about the broader population of non-college educated individuals.

How many observations is “enough”? The short answer is: it depends. The long answer: we will cover this extensively in future weeks!