1 Causal Inference and Potential Outcomes

1.1 Lecture review

This week we introduced the potential outcomes framework for causal inference. There are excellent introductions to this framework in Angrist and Pischke, 2009 (Chapter 1) and in Gerber and Green, 2012 (Chapter 1). The Paul Holland article on Statistics and Causal Inference provides an excellent history of the framework, and relates the conception of causality we focus on to long-standing philosophical discussions. If you are looking for inspiration, and a fun read, then the first half of David Freedman’s article on Statistical Models and Shoe Leather (p. 291-300) provides an interesting history of John Snow’s (no, not the guy from Game of Thrones) study of cholera in 19th century London. The second half of that article is also worth reading, as a fairly strong argument for why “statistical technique can seldom be an adequate substitute for good design, relevant data, and testing predictions against reality in a variety of settings.”

1.2 Seminar

In today’s seminar we will be focussing on the basics of using R, which is the statistical programming language we will be using throughout the course. There are some basic instructions on this page, which should help you to learn your way around R, and which we will follow in class. You will then also be asked to complete an online learning activity as part of this week’s homework.

1.2.1 Getting Started

Install R and RStudio on your computer by downloading them from the following sources:

1.2.2 RStudio

Let’s get acquainted with R. When you start RStudio for the first time, you’ll see three panes:

1.2.3 Console

The Console in RStudio is the simplest way to interact with R. You can type some code at the Console and when you press ENTER, R will run that code. Depending on what you type, you may see some output in the Console or if you make a mistake, you may get a warning or an error message.

Let’s familiarize ourselves with the console by using R as a simple calculator:

2 + 4
[1] 6

Now that we know how to use the + sign for addition, let’s try some other mathematical operations such as subtraction (-), multiplication (*), and division (/).

10 - 4
[1] 6
5 * 3
[1] 15
7 / 2
[1] 3.5
You can use the cursor or arrow keys on your keyboard to edit your code at the console:
- Use the UP and DOWN keys to re-run something without typing it again
- Use the LEFT and RIGHT keys to edit

Take a few minutes to play around at the console and try different things out. Don’t worry if you make a mistake, you can’t break anything easily!

1.2.4 Functions

Functions are a set of instructions that carry out a specific task. Functions often require some input and generate some output. For example, instead of using the + operator for addition, we can use the sum function to add two or more numbers.

sum(1, 4, 10)
[1] 15

In the example above, 1, 4, 10 are the inputs and 15 is the output. A function always requires the use of parenthesis or round brackets (). Inputs to the function are called arguments and go inside the brackets. The output of a function is displayed on the screen but we can also have the option of saving the result of the output. More on this later.

1.2.5 Getting Help

Another useful function in R is help which we can use to display online documentation. For example, if we wanted to know how to use the sum function, we could type help(sum) and look at the online documentation.


The question mark ? can also be used as a shortcut to access online help.


Use the toolbar button shown in the picture above to expand and display the help in a new window.

Help pages for functions in R follow a consistent layout generally include these sections:

Description A brief description of the function
Usage The complete syntax or grammar including all arguments (inputs)
Arguments Explanation of each argument
Details Any relevant details about the function and its arguments
Value The output value of the function
Examples Example of how to use the function

1.2.6 The Assignment Operator

Now we know how to provide inputs to a function using parenthesis or round brackets (), but what about the output of a function?

We use the assignment operator <- for creating or updating objects. If we wanted to save the result of adding sum(1, 4, 10), we would do the following:

myresult <- sum(1, 4, 10)

The line above creates a new object called myresult in our environment and saves the result of the sum(1, 4, 10) in it. To see what’s in myresult, just type it at the console:

[1] 15

Take a look at the Environment pane in RStudio and you’ll see myresult there.

To delete all objects from the environment, you can use the broom button as shown in the picture above.

We called our object myresult but we can call it anything as long as we follow a few simple rules. Object names can contain upper or lower case letters (A-Z, a-z), numbers (0-9), underscores (_) or a dot (.) but all object names must start with a letter. Choose names that are descriptive and easy to type.

Good Object Names Bad Object Names
result a
myresult x1
my.result this.name.is.just.too.long

1.2.7 Sequences

We often need to create sequences when manipulating data. For instance, you might want to perform an operation on the first 10 rows of a dataset so we need a way to select the range we’re interested in.

There are two ways to create a sequence. Let’s try to create a sequence of numbers from 1 to 10 using the two methods:

  1. Using the colon : operator. If you’re familiar with spreadsheets then you might’ve already used : to select cells, for example A1:A20. In R, you can use the : to create a sequence in a similar fashion:
 [1]  1  2  3  4  5  6  7  8  9 10
  1. Using the seq function we get the exact same result:
seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10

The seq function has a number of options which control how the sequence is generated. For example to create a sequence from 0 to 100 in increments of 5, we can use the optional by argument. Notice how we wrote by = 5 as the third argument. It is a common practice to specify the name of argument when the argument is optional.

seq(0, 100, by = 5)
 [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80
[18]  85  90  95 100

Take a look at the help page for seq to see what other options are available.


Now it’s your turn:

  • Create a sequence of odd numbers between 0 and 100 and save it in an object called odd_numbers
odd_numbers <- seq(1, 100, 2)
  • Next, display odd_numbers on the console to verify that you did it correctly
 [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
[24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
[47] 93 95 97 99
  • What do the numbers in square brackets [ ] mean? Look at the number of values displayed in each line to find out the answer.

  • Use the length function to find out how many values are in the object odd_numbers.
    • HINT: Try help(length) and look at the examples section at the end of the help screen.
[1] 50

1.2.8 Scripts

The Console is great for simple tasks but if you’re working on a project you would mostly likely want to save your work in some sort of a document or a file. Scripts in R are just plain text files that contain R code. You can edit a script just like you would edit a file in any word processing or note-taking application.

Create a new script using the menu or the toolbar button as shown below.

Once you’ve created a script, it is generally a good idea to give it a meaningful name and save it immediately. For our first session save your script as seminar1.R

Familiarize yourself with the script window in RStudio, and especially the two buttons labeled Run and Source

There are a few different ways to run your code from a script.

One line at a time Place the cursor on the line you want to run and hit CTRL-ENTER or use the Run button
Multiple lines Select the lines you want to run and hit CTRL-ENTER or use the Run button
Entire script Use the Source button

1.2.9 Data frames

A data frame is an object that holds data in a tabular format similar to how spreadsheets work. Variables are generally kept in columns and observations are in rows.

Although you can create a data frame manually, in most cases you will create a data frame by loading a dataset from a file. For now however, we will simply use a dataset that comes pre-installed with R.

Let’s take a look at a macroeconomic dataset called longley. The longley dataset is provided as a data frame of 7 variables and 16 observations.


The help screen describes each of the 7 variables. Now let’s see what’s in the longley dataset.

     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
1951         96.2 328.975      209.9        309.9    112.075 1951   63.221
1952         98.1 346.999      193.2        359.4    113.270 1952   63.639
1953         99.0 365.385      187.0        354.7    115.094 1953   64.989
1954        100.0 363.112      357.8        335.0    116.219 1954   63.761
1955        101.2 397.469      290.4        304.8    117.388 1955   66.019
1956        104.6 419.180      282.2        285.7    118.734 1956   67.857
1957        108.4 442.769      293.6        279.8    120.445 1957   68.169
1958        110.8 444.546      468.1        263.7    121.950 1958   66.513
1959        112.6 482.704      381.3        255.2    123.366 1959   68.655
1960        114.2 502.601      393.1        251.4    125.368 1960   69.564
1961        115.7 518.173      480.6        257.2    127.852 1961   69.331
1962        116.9 554.894      400.7        282.7    130.081 1962   70.551

We can also look at the longley dataset graphically using the View function which displays the data frame like a spreadsheet.


In order to access individual columns of a data frame we use the dollar sign $. For example, let’s see how to access the Year column

 [1] 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
[15] 1961 1962

Often we want to access certain observations (rows) or certain columns (variables) or a combination of the two without looking at the entire dataset all at once. We can use square brackets to subset data frames. In square brackets we put a row and a column coordinate separated by a comma. The row coordinate goes first and the column coordinate second. So longley[10, 3] returns the 10th row and third column of the data frame. If we leave the column coordinate empty this means we would like all columns. So, longley[10,] returns the 10th row of the dataset. If we leave the row coordinate empty, R returns the entire column. longley[,3] returns the third column of the dataset.

longley[10, 3] # element in 10th row, 3rd column
[1] 282.2
longley[10, ] # entire 10th row
     GNP.deflator    GNP Unemployed Armed.Forces Population Year Employed
1956        104.6 419.18      282.2        285.7    118.734 1956   67.857
longley[, 3] # entire 3rd column
 [1] 235.6 232.5 368.2 335.1 209.9 193.2 187.0 357.8 290.4 282.2 293.6
[12] 468.1 381.3 393.1 480.6 400.7

We can look at the first five rows of a dataset to get a better understanding of it with the colon in brackets like so: longley[1:5,]. We could display the second and fifth columns of the dataset by using the c() function in brackets like so: longley[, c(2,5)].

It’s your turn. Display all columns of the longley dataset and show rows 10 to 15. Next display all columns of the dataset and rows 4 and 7.

longley[10:15, ] # elements in 10th to 15th row, all columns
     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1956        104.6 419.180      282.2        285.7    118.734 1956   67.857
1957        108.4 442.769      293.6        279.8    120.445 1957   68.169
1958        110.8 444.546      468.1        263.7    121.950 1958   66.513
1959        112.6 482.704      381.3        255.2    123.366 1959   68.655
1960        114.2 502.601      393.1        251.4    125.368 1960   69.564
1961        115.7 518.173      480.6        257.2    127.852 1961   69.331
longley[c(4, 7), ] # elements in 4th and 7th row, all column
     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
1953         99.0 365.385      187.0        354.7    115.094 1953   64.989

1.2.10 Plots

Now let’s create some plots from the longley dataset. First let’s create a scatterplot with the Year variable on the x-axis and Employed on the y-axis.

plot(longley$Year, longley$Employed)

To create a line plot instead, we use the same function with one additional argument type = "l".

plot(longley$Year, longley$Employed, type = "l")

Now it’s your turn.

  • Use online help for the plot function and find out how to create a plot that includes both points and lines.
plot(longley$Year, longley$Employed, type = "b")

1.3 Homework

  1. Learning R

    One of the most common obstacles at the beginning of this course is learning how to use R. Depending on your previous experience with statistical software or other programming languages, you may find R easy or difficult to adopt. Nevertheless, it is worth investing some time to acquire some R skills for at least a couple of reasons. 1) We will be using R throughout the next 10 weeks so you need to learn it. 2) You may wish to do some statistical analysis in your dissertation, and R is a good choice for that because it is a) well-developed and b) free. 3) Knowing how to programme in R is a skill that is valuable to many employers.

    The main task for this week, then, is to familiarise yourself with the R language. In particular, I would like you all to complete the Introduction to R short-course on DataCamp. This will give you a helpful overview of the basic structures used in R, and will mean that we can move much faster from next week. It should take a couple of hours for you to complete, you can start and stop at any time, and the course is free.

  2. Potential Outcomes review

    The following questions are designed to help you get familiar with the potential outcomes framework for causal inference that we discussed in the lecture.

    1. Explain the notation \(Y_{0i}\).

      The potential outcome for subject \(i\) if this subject were untreated. Put another way: the untreated potential outcome for subject i.

    2. Explain the notation \(Y_{1i}\).

      The potential outcome for subject \(i\) if this subject were treated. Put another way: the treated potential outcome for subject i.

    3. Contrast the meaning of \(Y_{0i}\) with the meaning of \(Y_i\).

      The first is the potential outcome for subject \(i\) if this subject were untreated. The second is simply the observed outcome for subject \(i\).

    4. Can we observe both \(Y_{0i}\) and \(Y_{1i}\) for any individual unit at the same time?

      No, recall that:

      \(Y_{0i}\) = the potential outcome for \(i\) under control.

      \(Y_{1i}\) = the potential outcome for \(i\) under treatment.

      Only one of the two potential outcomes for \(i\) can ever be realized, as a subject cannot be under control and treatment simultaneously. Consequently, observing both potential outcomes is not possible. This is known as the “fundamental problem of causal inference”.

    5. If \(D_i\) is a binary variable that gives the treatment status for subject \(i\) (1 if treated, 0 if control), what is the meaning of \(E[Y_{0i}|D_i = 1]\)?

      The expected value of the potential outcome for subject i if the subject were untreated, given that this subject actually receives treatment. Put another way: the expected value of the untreated potential outcome for a subject in the treatment group.

    6. The table below contains the potential outcomes (\(Y_{1i}\) and \(Y_{0i}\)) and the treatment indicator (\(D_i\)) from a hypothetical experiment with 6 units. Complete the following calculations by hand.

      Unit \(Y_{1i}\) \(Y_{0i}\) \(D_i\)
      1 2 2 1
      2 3 -1 1
      3 -1 9 1
      4 17 8 0
      5 12 9 0
      6 9 1 0
      1. List the observed outcomes (\(Y_i\)) for the experiment based on the table above.

        2, 3, -1, 8, 9, 1

        Note that in a real experiment, these are the only values (along with the treatment assignment) we would observe. The other potential outcomes (i.e. \(Y_{1i}\) for observations with \(D_i = 0\) and \(Y_{0i}\) for observations with \(D_i = 1\) are unobservable.)

      2. Calculate the “true” average treatment effect (ATE) based on the potential outcomes.

        \(\tau_\text{ATE} = \frac{0 + 4 - 10 + 9 + 3 + 8}{6} = \frac{14}{6} = 2.33\)

      3. Calculate the “true” average treatment effect on the treated (ATT) based on the potential outcomes.

        \(\tau_\text{ATT} = \frac{0 + 4 - 10}{3} = \frac{-6}{3} = -2\)

      4. Calculate the “estimated” average treatment effect based on the naive difference in group means for treatment and control conditions from the observed outcomes. Explain the difference between this estimate and the “true” average treatment effect.

        \(\hat{\tau_\text{ATE}} = E[Y_i|D_i = 1] - E[Y_i|D_i=0] = \frac{2+3-1}{3} - \frac{8+9+1}{3} = \frac{4}{3} - \frac{18}{3} = -4.67\)

        Recall that \(E[DIGM] = E[\tau_i|D_i=1] + E[Y_{0i}|D_i = 1] - E[Y_{0i}|D_i=0]\), meaning that the difference in group means is an unbiased estimator of the ATE only when a) the ATE is equal to the ATT, and b) there is no selection bias. In this case neither are true, and so this estimated ATE is very different from the “true” ATE:

        \(\tau_\text{ATT} = \frac{0 + 4 + -10}{3} = -2\)

        \(\text{Selection bias} = \frac{2 -1 +9}{3} - \frac{8 +9 +1}{3} = -2.667\)