3 Describing quantitative data

3.1 Overview

In the lecture this week, we discuss descriptive statistics and visualisation of data, starting with single variables and then looking at examples with multiple variables.

In seminar this week, we will cover the following topics:

  1. Calculating variable summaries, tables and correlations
  2. Working with missing data
  3. Scatterplots, boxplots, histograms

We will also spend time thinking about the assumptions required to make causal claims from analyses of observational data.

Before coming to the seminar

  1. Please read sections 3.1 to 3.6 of chapter 3, “Measurement”, in Quantitative Social Science: An Introduction

3.2 Seminar

Can transphobia be reduced through in-person conversations and perspective-taking exercises, or active processing? Following up on previous research that had been shown to be fabricated, two researchers conducted a door-to-door canvassing experiment in South Florida targeting anti-transgender prejudice in order to answer this question. Canvassers held single, approximately 10-minute conversations that encouraged actively taking the perspective of others with voters to see if these conversations could markedly reduce prejudice.

Broockman, David and Joshua Kalla. 2016. “Durably reducing transphobia: a field experiment on door-to-door canvassing.” Science, Vol. 352, No. 6282, pp. 220-224.

In the experiment, the authors first recruited registered voters (\(n=68378\)) via mail for an online baseline survey, presented as the first in a series of surveys. They then randomly assigned respondents of this baseline survey (\(n=1825\)) to either a treatment group targeted with the intervention (\(n=913\)) or a placebo group targeted with a conversation about recycling (\(n=912\)). For the intervention, 56 canvassers first knocked on voters’ doors unannounced. Then, canvassers asked to speak with the subject on their list and confirmed the person’s identity if the person came to the door. A total of several hundred individuals (\(n=501\)) came to their doors in the two conditions. For logistical reasons unrelated to the original study, we further reduce this dataset to (\(n=488\)) which is the full sample that appears in the transphobia.csv data.

The canvassers then engaged in a series of strategies previously shown to facilitate active processing under the treatment condition: canvassers informed voters that they might face a decision about the issue (whether to vote to repeal the law protecting transgender people); canvassers asked voters to explain their views; and canvassers showed a video that presented arguments on both sides. Canvassers defined the term “transgender” at this point and, if they were transgender themselves, noted this. The canvassers next attempted to encourage “analogic perspective-taking”. Canvassers first asked each voter to talk about a time when they themselves were judged negatively for being different. The canvassers then encouraged voters to see how their own experience offered a window into transgender people’s experiences, hoping to facilitate voters’ ability to take transgender people’s perspectives. The intervention ended with another attempt to encourage active processing by asking voters to describe if and how the exercise changed their mind. All of the former steps constitutes the “treatment.”

The placebo group was reminded that recycling was most effective when everyone participates. The canvassers talked about how they were working on ways to decrease environmental waste and asked the voters who came to the door about their support for a new law that would require supermarkets to charge for bags instead of giving them away for free. This was meant to mimic the effect of canvassers interacting with the voters in face-to-face conversation on a topic different from transphobia.

The authors then asked the individuals who came to their doors in either condition (\(n=488\)) to complete follow-up online surveys via email presented as a continuation of the baseline survey. These follow-up surveys began 3 days, 3 weeks, 6 weeks, and 3 months after the intervention when the baseline survey was also conducted. For the purposes of this exercise, we will be using the tolerance.t# variables (where # is 0 through 4) as the main outcome variables of interest. The authors constructed these dependent variables tolerance.t# as indexes by using several other measures that are not included in this exercise. In building this index, the authors scaled the variables such that they have a mean of 0 and standard deviation of 1 for the placebo group. Higher values indicate higher tolerance, lower values indicate lower tolerance relative to the placebo group.

The data set is the file transphobia.csv. Variables that begin with vf_ come from the voter file. Variables in this dataset are described below:

Name Description
vf_age Age
vf_party Party: D=Democrats, R=Republicans and N=Independents
vf_racename Race: African American, Caucasian, Hispanic
vf_female Gender: 1 if female, 0 if male
treat_ind Treatment assignment: 1=treatment, 0=placebo
treatment.delivered Intervention was actually delivered (=TRUE) vs. was not (=FALSE)
tolerance.t0 Outcome tolerance variable at Baseline
tolerance.t1 (see above) Captured at 3 days after Baseline
tolerance.t2 (see above) Captured at 3 weeks after Baseline
tolerance.t3 (see above) Captured at 6 weeks after Baseline
tolerance.t4 (see above) Captured at 3 months after Baseline

Loading data

Once you have downloaded the data, put the .csv into the data folder that you created last week, and then create a new R script for this week. Then load the data using the read.csv() function as we did last week:

transphobia <- read.csv("data/transphobia.csv")

Remember, the above command assumes that your working directory is set to the same folder where you have created your new R script, and that the data file is saved in a folder named data within the folder where your script lives. If you are having trouble setting this up, revisit last week’s seminar assignment.

Question 1

For this question, we will learn to use the summary() function, which provides basic summary statistics for one or more variables. Depending on whether the variable is interval level or nominal/ordinal, you will get a different looking summary:

summary(transphobia$vf_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   36.00   51.00   49.32   62.00   90.00
summary(transphobia$vf_party)
##    Length     Class      Mode 
##       488 character character

You can also use the summary() function on the entire data frame transphobia to get summaries of every variable at once.

The first thing we are going to do, building on ideas from last week, is check whether it looks like the experiment achieved balance between treatment and control on some of the other variables. Note that there are two treatment variables in the data set, treat_ind and treatment.delivered. The former is whether some one was assigned to the treatment group, the latter is whether they actually received the treatment. Using the table() command that we learned previously, we can calculate the cross-tabulation of these two variables:

table(transphobia$treat_ind,transphobia$treatment.delivered)
##    
##     FALSE TRUE
##   0   241   11
##   1    51  185

When you do the table() command with two arguments, as in the above example, the first variable forms the rows, the second variable forms the columns. So the top right cell, for example, indicates that there were 11 people for whom treat_ind = 0 and treatment.delivered = 1. That is, people who received the treatment even though they were not supposed to. (Note: this happened because some of the canvassers made errors in following the experimental protocol.)

We can translate these into proportions using the prop.table() command:

prop.table(table(transphobia$treat_ind,transphobia$treatment.delivered))
##    
##          FALSE       TRUE
##   0 0.49385246 0.02254098
##   1 0.10450820 0.37909836

By default, the prop.table command calculates proportions of the entire table. Frequently it is useful to just calculate these within rows or columns:

prop.table(table(transphobia$treat_ind,transphobia$treatment.delivered),1) # 1 if by row, 2 if by column
##    
##          FALSE       TRUE
##   0 0.95634921 0.04365079
##   1 0.21610169 0.78389831

We can see from these tables that there was non-compliance with the experimental protocol. Not everyone who was supposed to receive the treatment did so. (What proportion did, according to the above output?) Some people who were not supposed to receive the treatment, nonetheless did so. (What proportion, according to the above output?)

Use the prop.table() function to assess whether there is any difference in the proportion of Democrats (D), Independents (N) and Republicans (R) 1. Between those who were supposed to receive treatment treat_ind = 1 and those who were nottreat_ind = 0? 2. Between those who actually received treatment treatment.delivered = TRUE and those who did not treatment.delivered = FALSE?

Reveal answer

1

prop.table(table(transphobia$treat_ind,transphobia$vf_party),1)
##    
##             D         N         R
##   0 0.4642857 0.2539683 0.2817460
##   1 0.5296610 0.2372881 0.2330508

The proportion of Republicans was somewhat lower (0.23 vs 0.28), and the proportion of Democrats somewhat higher (0.53 vs 0.46), among those who were randomly assigned to receive treatment vs those who were not.

2

prop.table(table(transphobia$treatment.delivered,transphobia$vf_party),1)
##        
##                 D         N         R
##   FALSE 0.4417808 0.2705479 0.2876712
##   TRUE  0.5765306 0.2091837 0.2142857

The imbalance in treatment delivery was larger than the imbalance after treatment assignment. The proportion of Republicans was lower (0.21 vs 0.29), and the proportion of Democrats higher (0.58 vs 0.44), among those who actually received treatment vs those who did not.

Question 2

The data set only includes individuals for whom the baseline survey was successfully conducted, however the study design involved four follow-up waves of the survey. Not all participants in the study were successfully contacted for all of these follow-up waves. As a result, there is missing data in the data set. You can see this immediately if you use the head() command on the data set. There are NA values for the measure of tolerance in several all post-baseline waves for the third observation in the data set, and for the final wave t4 for the first observation in the data set.

head(transphobia)
##   vf_age vf_party      vf_racename vf_female tolerance.t0 tolerance.t1
## 1     29        D African American         0  -1.94030332   -2.2304567
## 2     35        D African American         1  -0.08454013    0.8072470
## 3     72        N African American         1  -0.08454013           NA
## 4     63        D African American         1  -0.08454013   -0.0961481
## 5     51        N        Caucasian         1   0.14480496   -0.3215813
## 6     26        D African American         1  -0.07212884   -0.0961481
##   tolerance.t2 tolerance.t3 tolerance.t4 treatment.delivered treat_ind
## 1 -0.679679229   -1.1695175           NA               FALSE         0
## 2  0.742843981    0.9462945   0.27998715                TRUE         1
## 3           NA           NA           NA               FALSE         0
## 4 -0.157508021    0.0242335  -0.03044013                TRUE         1
## 5 -0.156093923   -0.2441662   0.28148349               FALSE         0
## 6 -0.008982646   -0.1092216  -0.31239390               FALSE         0

You cannot use the operator == to test for the value NA in R. For example, the command transphobia$tolerance.t1 == NA will not tell you which observations are NA for the variable transphobia$tolerance.t1, but will simply return all NA values. To identify which observations have NA values, you must use the is.na() function.

head(transphobia$tolerance.t1)
## [1] -2.2304567  0.8072470         NA -0.0961481 -0.3215813 -0.0961481
head(is.na(transphobia$tolerance.t1))
## [1] FALSE FALSE  TRUE FALSE FALSE FALSE

Thus, if you wanted to assess whether men or women were more likely to go missing in wave 1 (post-baseline), you could do the following:

prop.table(table(transphobia$vf_female,is.na(transphobia$tolerance.t1)),1)
##    
##         FALSE      TRUE
##   0 0.8125000 0.1875000
##   1 0.8892857 0.1107143

Apparently, while 19% of men (female = 0) from the baseline survey were missing at wave 1, 11% of women (female = 1) were missing at wave 1, suggesting different rates of attrition.

Use the functions described above and the variables tolerance.t1, tolerance.t2, tolerance.t3 and tolerance.t4 to assess whether 1. Men or women were more likely to be missing in each post-baseline wave of the survey 2. Democrats, Republicans or Independents were more likely to be missing in each post-baseline wave of the survey

Reveal answer

1

prop.table(table(transphobia$vf_female,is.na(transphobia$tolerance.t1)),1)
prop.table(table(transphobia$vf_female,is.na(transphobia$tolerance.t2)),1)
prop.table(table(transphobia$vf_female,is.na(transphobia$tolerance.t3)),1)
prop.table(table(transphobia$vf_female,is.na(transphobia$tolerance.t4)),1)
##    
##         FALSE      TRUE
##   0 0.8125000 0.1875000
##   1 0.8892857 0.1107143
##    
##         FALSE      TRUE
##   0 0.7740385 0.2259615
##   1 0.8035714 0.1964286
##    
##         FALSE      TRUE
##   0 0.7692308 0.2307692
##   1 0.8142857 0.1857143
##    
##         FALSE      TRUE
##   0 0.7355769 0.2644231
##   1 0.7928571 0.2071429

The proportion of men who failed to respond to subsequent waves of the survey is always higher than for women, but the difference doesn’t get any larger after the first wave t1

2

prop.table(table(transphobia$vf_party,is.na(transphobia$tolerance.t1)),1)
prop.table(table(transphobia$vf_party,is.na(transphobia$tolerance.t2)),1)
prop.table(table(transphobia$vf_party,is.na(transphobia$tolerance.t3)),1)
prop.table(table(transphobia$vf_party,is.na(transphobia$tolerance.t4)),1)
##    
##         FALSE      TRUE
##   D 0.8512397 0.1487603
##   N 0.8750000 0.1250000
##   R 0.8492063 0.1507937
##    
##         FALSE      TRUE
##   D 0.7685950 0.2314050
##   N 0.8583333 0.1416667
##   R 0.7698413 0.2301587
##    
##         FALSE      TRUE
##   D 0.7685950 0.2314050
##   N 0.8416667 0.1583333
##   R 0.8015873 0.1984127
##    
##         FALSE      TRUE
##   D 0.7851240 0.2148760
##   N 0.8083333 0.1916667
##   R 0.6984127 0.3015873

The proportion of independents who are missing in later waves is always lower than for those who are Democrats or Republicans, but the relative levels of missingness for the latter two groups are inconsistent. They are similar in t1 and t2, but then differ in opposite directions in t3 and t4

Question 3

Let’s look at the distribution of tolerance in the first wave, and how it differs between those who were supposed to receive treatment and those who were not (treat_ind).

boxplot(tolerance.t1~treat_ind,data=transphobia)

Mean seems to be higher in the treatment group than the control group.

mean(transphobia$tolerance.t1[transphobia$treat_ind == 0],na.rm=TRUE)
## [1] 0.009213218
mean(transphobia$tolerance.t1[transphobia$treat_ind == 1],na.rm=TRUE)
## [1] 0.1535359
  1. Check whether the means between the treatment and control group were already different in the baseline survey before the treatment was delivered.
  2. Calculate the difference in means between treatment and control groups for all waves of the survey.

Reveal answer

1

mean(transphobia$tolerance.t0[transphobia$treat_ind == 0],na.rm=TRUE)
mean(transphobia$tolerance.t0[transphobia$treat_ind == 1],na.rm=TRUE)
## [1] 0.009480114
## [1] -0.03055836

They were different, but not by nearly as much as after the treatment. We will look at this in more detail in the homework.

2

mean(transphobia$tolerance.t0[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t0[transphobia$treat_ind == 0],na.rm=TRUE)

mean(transphobia$tolerance.t1[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t1[transphobia$treat_ind == 0],na.rm=TRUE)

mean(transphobia$tolerance.t2[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t2[transphobia$treat_ind == 0],na.rm=TRUE)

mean(transphobia$tolerance.t3[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t3[transphobia$treat_ind == 0],na.rm=TRUE)

mean(transphobia$tolerance.t4[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t4[transphobia$treat_ind == 0],na.rm=TRUE)
## [1] -0.04003847
## [1] 0.1443226
## [1] 0.1173435
## [1] 0.2469228
## [1] 0.1762387

The effect of the treatment was persistent across waves / over time. This is one of the core results of this study.

Question 4

One possibility is that the treatment could have been polarising, which is to say that it might not just change the average level of tolerance but also change the degree of dispersion in tolerance. Put differently, even if some people responded positively, others might have responded negatively. If we look at the histograms for tolerance.t1 among those assigned to treatment (treat_ind == 1) versus those who were not (treat_ind == 0), we see some evidence that this could be the case:

par(mfrow=c(2,1)) 
hist(transphobia$tolerance.t1[transphobia$treat_ind == 0])
hist(transphobia$tolerance.t1[transphobia$treat_ind == 1])

par(mfrow=c(1,1))

Note that the par() command above is used to set graphical parameters, in this case to put two plots in a 2 row x 1 column grid (mfrow=c(2,1)), and then to return to the original single plot setting (mfrow=c(1,1)). The par() command has a bewildering array of options which you can look up in its help file.

The command sd() calculates the standard deviation for set of values in R.

  1. Replace the mean() command in your code for Question 3 with sd(), and calculate the difference in standard deviations between treatment and control groups in wave 0 and in wave 1.

Reveal answer

1

sd(transphobia$tolerance.t0[transphobia$treat_ind == 1],na.rm=TRUE) - sd(transphobia$tolerance.t0[transphobia$treat_ind == 0],na.rm=TRUE)

sd(transphobia$tolerance.t1[transphobia$treat_ind == 1],na.rm=TRUE) - sd(transphobia$tolerance.t1[transphobia$treat_ind == 0],na.rm=TRUE)
## [1] 0.114463
## [1] 0.1405

There is a difference in standard deviation between the treatment group and the control group in wave 1, but it appears to have been present (if very slightly weaker) in the baseline wave before the treatment happened, so there is little evidence of polarising effect of the treatment.

Question 5

How similar are individuals’ tolerance scores across different waves? We might look at this visually, by using a scatter plot with the tolerance score for one wave on the x-axis, and the tolerance score for another wave on the y-axis:

plot(transphobia$tolerance.t0,transphobia$tolerance.t1)

Clearly these are positively correlated. We can calculate the correlation coefficient \(\rho\) with the command:

cor(transphobia$tolerance.t0,transphobia$tolerance.t1,use="pairwise.complete.obs")
## [1] 0.8290495

Both scatterplots and correlation coefficients compare two continuous/interval-level variables at a time, but sometimes it is useful to look at all the possible comparisons of a set of variables. In this case, we have five measurements of the tolerance scale, so we might want to look at all the possible pairwise comparisons at once. We can do that graphically using the pairs() command, and a new way of selecting subsets of variables by name from a data frame:

tolerance_allwaves <- transphobia[,c("tolerance.t0","tolerance.t1","tolerance.t2","tolerance.t3","tolerance.t4")]
pairs(tolerance_allwaves)

We can also create a matrix of pairwise correlation coefficients using the following command. Note that it is not necessary to wrap the cor() command in the round() command, the latter simply makes it easier to read the correlations by rounding them to two digits.

round(cor(tolerance_allwaves,use="pairwise.complete.obs"),2)
##              tolerance.t0 tolerance.t1 tolerance.t2 tolerance.t3 tolerance.t4
## tolerance.t0         1.00         0.83         0.81         0.83         0.82
## tolerance.t1         0.83         1.00         0.87         0.87         0.87
## tolerance.t2         0.81         0.87         1.00         0.92         0.90
## tolerance.t3         0.83         0.87         0.92         1.00         0.91
## tolerance.t4         0.82         0.87         0.90         0.91         1.00

These correlations provide an additional confirmation that the treatment did something enduring. The correlations between the baseline wave t0 and the other waves are generally lower than the correlations between the post-treatment waves t1, t2 and t3. That is, there was more change in tolerance scores at the individual-level between the baseline wave and the first post-treatment wave than there was between subsequent waves.

3.3 Homework

Question 1

Define new variables called missing_t1, missing_t2, etc for all of the post-treatment waves.

Use table() and prop.table to assess whether a respondent going missing in later waves was associated with whether they were assigned to treatment.

Calculate the mean level of baseline wave tolerance among those who were missing in wave 1, and those who were not missing in wave 1. Why might it matter whether these are different?

Reveal answer

transphobia$missing_t1 <- is.na(transphobia$tolerance.t1)
transphobia$missing_t2 <- is.na(transphobia$tolerance.t2)
transphobia$missing_t3 <- is.na(transphobia$tolerance.t3)
transphobia$missing_t4 <- is.na(transphobia$tolerance.t4)

prop.table(table(transphobia$treat_ind,transphobia$missing_t1),1)
prop.table(table(transphobia$treat_ind,transphobia$missing_t2),1)
prop.table(table(transphobia$treat_ind,transphobia$missing_t3),1)
prop.table(table(transphobia$treat_ind,transphobia$missing_t4),1)

mean(transphobia$tolerance.t0[transphobia$missing_t1 == TRUE])
mean(transphobia$tolerance.t0[transphobia$missing_t1 == FALSE])
##    
##         FALSE      TRUE
##   0 0.8809524 0.1190476
##   1 0.8305085 0.1694915
##    
##         FALSE      TRUE
##   0 0.8015873 0.1984127
##   1 0.7796610 0.2203390
##    
##         FALSE      TRUE
##   0 0.8214286 0.1785714
##   1 0.7669492 0.2330508
##    
##         FALSE      TRUE
##   0 0.8055556 0.1944444
##   1 0.7288136 0.2711864
## [1] -0.2131327
## [1] 0.02415432

Respondents assigned to receive the treatment were more likely to be missing in all subsequent waves of the survey. The treatment effect on missingness varies between 2 and 7 percentage points across the different waves.

Respondents who went missing after the first wave of the survey were substantially less tolerant in the baseline survey (0.23 points on the baseline tolerance scale). This is potentially a problem because if the treatment causes the less tolerant people to opt out of the subsequent waves of the survey, we might worry that any difference in tolerance between those who were treated and those who were not in subsequent waves is not because the treatment made people more tolerant, but instead because the treatment made the intolerant people drop out of the survey.

Question 2

As we saw, there were some issues with the implementation of the study, including some compliance problems with treatment, and attrition problems that we just examined. As an alternative way of analyzing the data, do the following.

First, subset the entire data set to use only the observations which were not missing at t1.

Second, using the variable treatment.delivered as the treatment variable, and tolerance_t0 and tolerance_t1 as the before and after outcomes, treat the study as a difference in difference design. Calculate the mean value of tolerance_t1 and tolerance_t0 for each value of treatment.delivered, and then calculate the difference in differences. Are the results generally similar to those we found previously?

Reveal answer

transphobia2 <- transphobia[transphobia$missing_t1 == FALSE,]

diff_in_diff <- (mean(transphobia2$tolerance.t1[transphobia2$treatment.delivered == TRUE]) - mean(transphobia2$tolerance.t0[transphobia2$treatment.delivered == TRUE])) - 
  (mean(transphobia2$tolerance.t1[transphobia2$treatment.delivered == FALSE]) - mean(transphobia2$tolerance.t0[transphobia2$treatment.delivered == FALSE]))

diff_in_diff
## [1] 0.2034032

The difference-in-differences is 0.203, which is slightly larger than the simple difference in means at wave 1, but broadly similar. Despite the non-compliance with the treatment and attrition, the evidence is pretty strong that there was a causal effect of the treatment.

Question 3

Use the data set you constructed in Question 2, which only includes respondents who were not missing at wave 1. Construct a scatterplot with age on the x-axis and tolerance_t0 on the y-axis, using the observations for which treatment.delivered == TRUE. Calculate the correlation between age and tolerance_t0. Now, generate the same plot and correlation using tolerance_t1 for the same observations. What does the change or lack of change in this correlation tell us about the likely effect of the treatment on people of different ages?

Reveal answer

par(mfrow=c(1,2))
plot(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t0[transphobia2$treatment.delivered == TRUE],
     main="Baseline Tolerance")
plot(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t1[transphobia2$treatment.delivered == TRUE],
     main="Post-Treatment Wave 1 Tolerance")

cor(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t0[transphobia2$treatment.delivered == TRUE])

cor(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t1[transphobia2$treatment.delivered == TRUE])
## [1] -0.2686877
## [1] -0.1847027

The correlation between age and tolerance is negative in the baseline wave. This indicates that older people tend to have lower measured tolerance than younger people. It appears that the treatment reduces the magnitude of this negative correlation between age and transphobia, which indicates that the effect of the treatment was more positive on older people than on younger people.