8 Sampling, Uncertainty, and Confidence Intervals

8.1 Overview

In the lecture this week, we changed focus from thinking about causal inference, and started instead to think about statistical inference. In particular, we considered the problem of making inferences about population-level quantities of interest when all we can observe is a sample of data that is taken from that population. We discussed sampling variation, and the concept of the sampling distribution: the distribution of estimates that would result from repeatedly sampling from the population and calculating our quantity of interest for every sample. We saw that while we cannot observe the sampling distribution directly, we know that it will always be approximately normally distributed (thanks to the central limit theorem), and that we can estimate the standard deviation of the sampling distribution by calculating the standard error. Finally, we saw that we can quantify the uncertainty in our estimates by constructing a confidence interval.

In seminar this week, we will:

  1. Practice calculating the difference in means, the standard error, and various confidence intervals.
  2. Practice interpreting those quantities.
  3. Learn to use the t.test() function to calculate confidence intervals.

Before coming to the seminar

  1. Please read chapter 6, “Probability” and chapter 7, “Uncertainty” in Quantitative Social Science: An Introduction

8.2 Seminar

In this exercise, we examine the hypothesis that family members of people who died in the terrorist attacks of 9/11 subsequently became more politically engaged. Estimating the specific and long-term political effects of terrorist attacks on the families of victims is important for understanding the effects of terrorist activities. This exercise is based on: Hersh, E. D. 2013. “Long-Term Effect of September 11 on the Political Behavior of Victims’ Families and Neighbors.Proceedings of the National Academy of Sciences 110(52): 20959–63.

Download the data file, victims.csv, from the link above, and store it in your data folder as you have done in previous weeks. Then load the data into R:

victims <- read.csv("data/victims.csv")

The data contains observations of 7054 individuals and includes the following variables:

Name Description
treatment Families of victims (1) vs control group (0)
ge20xx Voting in the 20xx general election (1=voted, 0 = did not vote)
fam.members Number of family members living with voter at their address
age Voter’s age
pct.white Proportion of non-Hispanic white voters living on the same block
median.income Median income of voters living on the same block
female 1 if the voter is female, 0 otherwise

The treatment group in this study (i.e. observations where treatment = 1) are individuals living in New York who were related to people who died in the 9/11 attacks. The control group (treatment = 0) is also made up of individuals living in New York who were similar to the treated group with regard to demographics, prior political activities, and family and neighborhood characteristics, but who had not had family members die in the attacks.

For the purposes of this exercise, we will treat the data as being a random sample which is drawn from a broader population of units.

Question 1

Use the difference in means to estimate the average treatment effect of being a family member of a victim on voter turnout in 2000 (i.e. the year before the 9/11 attacks). Is it always the case that the difference in means that you calculate in this sample will be equal to the difference in means in the population? Why or why not?

Reveal answer

## Difference in means
treat_mean_00 <- mean(victims$ge2000[victims$treatment == 1])
control_mean_00 <- mean(victims$ge2000[victims$treatment == 0])
diff_00 <- treat_mean_00 - control_mean_00
## [1] -0.001831524

The difference in means here suggests that turnout was 0.18 percentage points lower for the treated group than the control group.

No, sampling variation means that the difference in means that we calculate in any particular sample will be different from the difference in means that exists in the population. Sometimes we will overestimate the difference in means (it will be too big), and sometimes we will underestimate the difference in means (it will be too small).

Question 2

Calculate the following quantities:

a. The standard error of the difference in means

Reveal answer

The formula for the standard error of the difference in means is \(SE(\hat{Y}_{X=1} - \hat{Y}_{X=0}) = \sqrt{\frac{Var(Y_{X=1})}{n_{X=1}} + \frac{Var(Y_{X=0})}{n_{X=0}}}\)

## Standard error
treat_var_00 <- var(victims$ge2000[victims$treatment == 1])
control_var_00 <- var(victims$ge2000[victims$treatment == 0])

treat_n_00 <- sum(victims$treatment == 1)
control_n_00 <- sum(victims$treatment == 0)

st_err_00 <- sqrt(treat_var_00/treat_n_00 + control_var_00/control_n_00)
## [1] 0.01550951

b. The 95% confidence interval, using the appropriate critical values from the standard normal distribution.

Reveal answer

The formula for the 95% confidence interval is \(\bar{Y}_{X = 1} - \bar{Y}_{X = 0} \pm 1.96 * SE(\bar{Y}_{X = 1} - \bar{Y}_{X = 0})\)

## 95% confidence interval
upper_00_95 <- diff_00 + 1.96 * st_err_00
lower_00_95 <- diff_00 - 1.96 * st_err_00
## [1] 0.02856712
## [1] -0.03223017

c. The 99% confidence interval, using the appropriate critical values from the standard normal distribution.

Reveal answer

The formula for the 99% confidence interval is \(\bar{Y}_{X = 1} - \bar{Y}_{X = 0} \pm 2.58 * SE(\bar{Y}_{X = 1} - \bar{Y}_{X = 0})\)

## 99% confidence interval
upper_00_99 <- diff_00 + 2.58 * st_err_00
lower_00_99 <- diff_00 - 2.58 * st_err_00
## [1] 0.03818302
## [1] -0.04184607

d. Provide a brief interpretation, with a particular focus on the meaning of the confidence intervals you just computed.

Reveal answer

The difference in means for the year 2000, before the 9/11 attacks took place, functions as a “placebo” test in this example. That is, there is no reason to think that the family members of 9/11 victims should differ in their political behaviour in elections before 9/11 from individuals who did not know someone who died in the attacks. In other words, if the treatment and control groups were similar prior to the administration of treatment, then the treatment effect for 2000 should be equal to zero. Although the difference in means calculation shows a very small difference between treatment and control groups in 2000 (i.e. the difference in means is not exactly zero in this sample), the 95% confidence interval ranges from -0.032 to 0.029. As this interval includes the value of zero, we conclude that there is no significant difference between treatment and control turnout rates in the election in 2000.

Question 3

Repeat the steps above, again comparing treatment and control observations and calculating the 95% confidence interval, but in this case for turnout the 2002 election. What is the substantive conclusion that you draw from this analysis?

Reveal answer

## Difference in means
treat_mean_02 <- mean(victims$ge2002[victims$treatment == 1])
control_mean_02 <- mean(victims$ge2002[victims$treatment == 0])
diff_02 <- treat_mean_02 - control_mean_02

## Standard error
treat_var_02 <- var(victims$ge2002[victims$treatment == 1])
control_var_02 <- var(victims$ge2002[victims$treatment == 0])

treat_n_02 <- sum(victims$treatment == 1)
control_n_02 <- sum(victims$treatment == 0)

st_err_02 <- sqrt(treat_var_02/treat_n_02 + control_var_02/control_n_02)

## 95% confidence interval
upper_02_95 <- diff_02 + 1.96 * st_err_02
lower_02_95 <- diff_02 - 1.96 * st_err_02
## [1] 0.04921145
## [1] -0.01115881

Consistent with the hypothesis, we find a positive difference in means for the 2002 election. On average, the turnout rate of individuals in the control group was 37.74%, whereas the turnout rate of individuals who were related to people who had died in the 9/11 attacks was 39.64%, implying a small positive sample average treatment effect.

However, the confidence interval that we have calculated ranges from -0.01 to 0.05 implying that we cannot be confident that the true treatment effect in the population is positive.

Question 4

a. Calculate the difference in mean turnout (and the associated 95% confidence intervals) between treatment and control units for all other election years in the data (2004, 2006, 2008, 2010, and 2012).

Rather than calculating the confidence intervals “by hand” as you did above, here use the t.test() function. The main arguments for this function are given in the table below:

Arguments Description
x The outcome values for one group of observations.
y The outcome values for another group of observations.
conf.level The level of confidence that we want to use to construct the confidence intervals.

Note that you can extract the estimated confidence intervals from the output of t.test() by using the $ sign. For instance, if your save the output of the function as t_test_output, then you can extract the confidence intervals by using t_test_output$conf.int.

Reveal answer

First, let’s confirm that the t.test() function provides the same results for the confidence intervals as those we calculated above:

t_test_00 <- t.test(x = victims$ge2000[victims$treatment == 1],
       y = victims$ge2000[victims$treatment == 0],
       conf.level = 0.95)


## [1] -0.03225057  0.02858752
## attr(,"conf.level")
## [1] 0.95
## [1] -0.03223017
## [1] 0.02856712

Yes, the confidence intervals are the same up to the 4th decimal place. As we discussed in the lecture, the small differences in the values after that point come from the fact that the t.test() function assumes that the sampling distribution for the difference in means follows a “t-distribution”, whereas we assumed that the sampling distribution follows a normal distribution. When the sample size is large, as it is here, these two approaches will always result in very similar values.

Now let’s calculate the confidence intervals for all the other years in the sample:

t_test_02 <- t.test(x = victims$ge2002[victims$treatment == 1],
       y = victims$ge2002[victims$treatment == 0],
       conf.level = 0.95)

t_test_04 <- t.test(x = victims$ge2004[victims$treatment == 1],
       y = victims$ge2004[victims$treatment == 0],
       conf.level = 0.95)

t_test_06 <- t.test(x = victims$ge2006[victims$treatment == 1],
       y = victims$ge2006[victims$treatment == 0],
       conf.level = 0.95)

t_test_08 <- t.test(x = victims$ge2008[victims$treatment == 1],
       y = victims$ge2008[victims$treatment == 0],
       conf.level = 0.95)

t_test_10 <- t.test(x = victims$ge2010[victims$treatment == 1],
       y = victims$ge2010[victims$treatment == 0],
       conf.level = 0.95)

t_test_12 <- t.test(x = victims$ge2012[victims$treatment == 1],
       y = victims$ge2012[victims$treatment == 0],
       conf.level = 0.95)
## [1] -0.01117918  0.04923182
## attr(,"conf.level")
## [1] 0.95
## [1] -0.03254799  0.02752645
## attr(,"conf.level")
## [1] 0.95
## [1] -0.007044625  0.053736783
## attr(,"conf.level")
## [1] 0.95
## [1] -0.008306075  0.051660301
## attr(,"conf.level")
## [1] 0.95
## [1] 0.007591085 0.068989192
## attr(,"conf.level")
## [1] 0.95
## [1] 0.000505756 0.061804922
## attr(,"conf.level")
## [1] 0.95

b. Interpret these results, commenting both on the estimated difference in means in each year, and also on the confidence intervals you constructed. What do these results suggest about the effects of terrorism on the political behaviour of family members of those who are killed in terrorist attacks?

Reveal answer

With the exception of the 2004 election, we observe a consistently positive effect of the attacks on voter turnout, and it does not appear to decay with time. Subject to the validity of the difference in means as a strategy for estimating causal effects (that is, subject to the assumption that the treatment and control groups were balanced prior to the administration of the treatment), this analysis lends support to the hypothesis that people who are personally affected by terrorist attacks become more engaged in political affairs than people who are not personally affected.

However, the confidence intervals for the 2002, 2004, 2006, and 2008 elections all include the value of 0. Since these intervals include zero, this implies that the turnout rates of treatment and control groups do not differ significantly from each other in these election years. In other words, this means that it is plausible that the causal effect in the population may not be positive. The only intervals that include only positive values are for the years of 2010 and 2012, nearly ten years after the attacks.

Question 5 (Hard question)

a. Use the code below to construct a data.frame which includes one observation per election year, and where the variables are the estimated difference in means in each year, and the upper and lower bounds of the confidence interval in each year. To create variables, use the c() function which concatenates many values together to form a vector.

# Create a variable for the years in the data
years <- c(2000, 2002, 2004, 2006, 2008, 2010, 2012)

# Calculate the difference in means for all remaining years
diff_04 <- mean(victims$ge2004[victims$treatment == 1]) - mean(victims$ge2004[victims$treatment == 0])
diff_06 <- mean(victims$ge2006[victims$treatment == 1]) - mean(victims$ge2006[victims$treatment == 0])
diff_08 <- mean(victims$ge2008[victims$treatment == 1]) - mean(victims$ge2008[victims$treatment == 0])
diff_10 <- mean(victims$ge2010[victims$treatment == 1]) - mean(victims$ge2010[victims$treatment == 0])
diff_12 <- mean(victims$ge2012[victims$treatment == 1]) - mean(victims$ge2012[victims$treatment == 0])

# Create a variable that includes the difference in means
difference_in_means <- c(diff_00, diff_02, diff_04, diff_06, diff_08, diff_10, diff_12)

# Create a variable that includes the *lower* bound of the confidence interval for each year
lower_ci <- c(t_test_00$conf.int[1], t_test_02$conf.int[1], t_test_04$conf.int[1], t_test_06$conf.int[1], t_test_08$conf.int[1], t_test_10$conf.int[1], t_test_12$conf.int[1])

# Create a variable that includes the *upper* bound of the confidence interval for each year
upper_ci <- c(t_test_00$conf.int[2], t_test_02$conf.int[2], t_test_04$conf.int[2], t_test_06$conf.int[2], t_test_08$conf.int[2], t_test_10$conf.int[2], t_test_12$conf.int[2])

# Combine all variables into a single data.frame to use with the plot() function
all_cis <- data.frame(years, difference_in_means, lower_ci, upper_ci)
##   years difference_in_means     lower_ci   upper_ci
## 1  2000        -0.001831524 -0.032250566 0.02858752
## 2  2002         0.019026320 -0.011179176 0.04923182
## 3  2004        -0.002510773 -0.032547992 0.02752645
## 4  2006         0.023346079 -0.007044625 0.05373678
## 5  2008         0.021677113 -0.008306075 0.05166030
## 6  2010         0.038290139  0.007591085 0.06898919
## 7  2012         0.031155339  0.000505756 0.06180492

b. Use the all_cis object with the plot() function to create a plot with years on the x-axis and difference_in_means on the y-axis.

Then, use the segments() function to draw the confidence intervals of the plot. segments() can be used to draw vertical lines on plots. It takes three main arguments. x0 tells R where on the x-axis the line should be drawn. y0 tells R where on the y-axis the line should begin. y1 tells R where on the y-axis the line should end.

Finally, add a horizontal line which intersects the y-axis at the value of 0 using the abline() function.

Reveal answer

plot(x = all_cis$years, 
     y = all_cis$difference_in_means, 
     xlab = "Year",
     ylab = "Average treatment effect",
     ylim = c(-0.04,0.07)) # the ylim argument sets the range of the yaxis to be wide enough to contain all the confidence intervals stored in the all_cis data.frame
segments(x0 = all_cis$years, # x0 contains the x-axis coordinates of the lines
         y0 = all_cis$lower_ci, # y0 contains the y-axis coordinate of the lower bound of the confidence interval
         y1 = all_cis$upper_ci) # y1 contains the y-axis coordinate of the upper bound of the confidence interval
abline(h = 0) # This code draws a horizontal line at the value of 0 on the y-axis

c. Interpret the plot you have created.

Reveal answer

The plot simply visualises the difference in means and the confidence intervals that you created in question 4: it shows that the average treatment effect is positive in all years except for 2004, but also that the confidence intervals overlap with zero in all years except for 2010 and 2012.

8.3 Homework

Question 1

To examine the validity of the difference in means estimates that you used above by checking whether possible confounders are balanced between the treatment and control groups. In particular, compare the difference in means for fam.members, age, pct.white, and median.income between the treatment and control groups. For each of these comparisons, also calculate the respective 95% confidence intervals. Provide a brief interpretation of the results. What can you conclude about the potential for confouding between the treatment and control groups?

Reveal answer

diff_fam_members <- mean(victims$fam.members[victims$treatment == 1]) - mean(victims$fam.members[victims$treatment == 0])
t_test_fam_members <- t.test(x = victims$fam.members[victims$treatment == 1],
                             y = victims$fam.members[victims$treatment == 0],
                             conf.level = 0.95)
## [1] -0.0598509
## [1] -0.14165792  0.02195612
## attr(,"conf.level")
## [1] 0.95
diff_age <- mean(victims$age[victims$treatment == 1]) - mean(victims$age[victims$treatment == 0])

t_test_age <- t.test(x = victims$age[victims$treatment == 1],
                             y = victims$age[victims$treatment == 0],
                             conf.level = 0.95)
## [1] -0.1607827
## [1] -1.0928107  0.7712453
## attr(,"conf.level")
## [1] 0.95
diff_pct.white <- mean(victims$pct.white[victims$treatment == 1]) - mean(victims$pct.white[victims$treatment == 0])
t_test_pct.white <- t.test(x = victims$pct.white[victims$treatment == 1],
                             y = victims$pct.white[victims$treatment == 0],
                             conf.level = 0.95)
## [1] 0.001567236
## [1] -0.01452926  0.01766374
## attr(,"conf.level")
## [1] 0.95
diff_median.income <- mean(victims$median.income[victims$treatment == 1]) - mean(victims$median.income[victims$treatment == 0])
t_test_median.income <- t.test(x = victims$median.income[victims$treatment == 1],
                             y = victims$median.income[victims$treatment == 0],
                             conf.level = 0.95)

## [1] 522.727
## [1] -1487.766  2533.220
## attr(,"conf.level")
## [1] 0.95

If the treatment and control groups are properly randomized, the means on the six possible confounders should be similar across the two groups. While none of the point-estimates are exactly zero, the value of zero is included in each of the six confidence intervals. That means that, at the 95% confidence level, we cannot be confident that the true population difference in means is not equal to zero. In other words, the magnitude of observed differences between them is consistent with what we would expect to see purely by chance. This suggests that confounding is not a concern here, and that the difference in means calculated above are unbiased estimates of the true causal effect in each election.

Question 2 (Hard question)

For this question you will need to use the sample() function. This function randomly samples values from a vector. For instance, if I have a vector named my_vector, I can sample 100 values from that vector with:

sample(x = my_vector, size = 100, replace = TRUE)

Where the x argument is the vector from which I would like to sample, the size argument says how many observations I would like to sample, and the replace argument says whether I would like to sample from the vector with or without replacement.

a. Plot a histogram of the pct.white variable using the hist() function.

Reveal answer


b. Use the sample() function to sample 200 values from the pct.white variable, and store the output as an object called pct_white_sample. Make sure that you set replace = TRUE in the sample function. What is the mean of the sampled values?

Reveal answer

pct_white_sample <- sample(victims$pct.white, 200, replace = TRUE)

## [1] 0.7251459

c. Create a function which can be used to repeat this sampling process by adapting the code below:

sample_pct_white <- function(){
  # Add code here which samples 200 values from the pct white variable and stores the output as an object called tmp_sample
  # Add code here which calculates the mean of tmp_sample

Reveal answer

sample_pct_white <- function(){
  tmp_sample <- sample(victims$pct.white, 200, replace = TRUE)

d. Use the replicate() function to repeat the sampling process 1000 times. Store the 1000 mean values in an object called sampling_distribution. (Look at this week’s lecture notes to see an example of the replicate function.)

Reveal answer

sampling_distribution <- replicate(1000, sample_pct_white())

e. Create a histogram of the sampling_distribution. What is the shape of the distribution? Why does the histogram take this distinctive shape?

Reveal answer


The distribution follows approximately a normal distribution. This is guaranteed by the central limit theorem, despite the fact that the underlying distribution of pct_white is not normally distributed, so long as the sample size is large (here it is 200).