3 Describing quantitative data
In the lecture this week, we discuss descriptive statistics and visualisation of data, starting with single variables and then looking at examples with multiple variables.
In seminar this week, we will cover the following topics:
- Calculating variable summaries, tables and correlations
- Working with missing data
- Scatterplots, boxplots, histograms
We will also spend time thinking about the assumptions required to make causal claims from analyses of observational data.
Before coming to the seminar
- Please read sections 3.1 to 3.6 of chapter 3, “Measurement”, in Quantitative Social Science: An Introduction
Can transphobia be reduced through in-person conversations and perspective-taking exercises, or active processing? Following up on previous research that had been shown to be fabricated, two researchers conducted a door-to-door canvassing experiment in South Florida targeting anti-transgender prejudice in order to answer this question. Canvassers held single, approximately 10-minute conversations that encouraged actively taking the perspective of others with voters to see if these conversations could markedly reduce prejudice.
Broockman, David and Joshua Kalla. 2016. “Durably reducing transphobia: a field experiment on door-to-door canvassing.” Science, Vol. 352, No. 6282, pp. 220-224.
In the experiment, the authors first recruited registered voters (\(n=68378\)) via mail for an online baseline survey, presented as the first in a series of surveys. They then randomly assigned respondents of this baseline survey (\(n=1825\)) to either a treatment group targeted with the intervention (\(n=913\)) or a placebo group targeted with a conversation about recycling (\(n=912\)). For the intervention, 56 canvassers first knocked on voters’ doors unannounced. Then, canvassers asked to speak with the subject on their list and confirmed the person’s identity if the person came to the door. A total of several hundred individuals (\(n=501\)) came to their doors in the two conditions. For logistical reasons unrelated to the original study, we further reduce this dataset to (\(n=488\)) which is the full sample that appears in the
The canvassers then engaged in a series of strategies previously shown to facilitate active processing under the treatment condition: canvassers informed voters that they might face a decision about the issue (whether to vote to repeal the law protecting transgender people); canvassers asked voters to explain their views; and canvassers showed a video that presented arguments on both sides. Canvassers defined the term “transgender” at this point and, if they were transgender themselves, noted this. The canvassers next attempted to encourage “analogic perspective-taking”. Canvassers first asked each voter to talk about a time when they themselves were judged negatively for being different. The canvassers then encouraged voters to see how their own experience offered a window into transgender people’s experiences, hoping to facilitate voters’ ability to take transgender people’s perspectives. The intervention ended with another attempt to encourage active processing by asking voters to describe if and how the exercise changed their mind. All of the former steps constitutes the “treatment.”
The placebo group was reminded that recycling was most effective when everyone participates. The canvassers talked about how they were working on ways to decrease environmental waste and asked the voters who came to the door about their support for a new law that would require supermarkets to charge for bags instead of giving them away for free. This was meant to mimic the effect of canvassers interacting with the voters in face-to-face conversation on a topic different from transphobia.
The authors then asked the individuals who came to their doors in either condition (\(n=488\)) to complete follow-up online surveys via email presented as a continuation of the baseline survey. These follow-up surveys began 3 days, 3 weeks, 6 weeks, and 3 months after the intervention when the baseline survey was also conducted. For the purposes of this exercise, we will be using the
tolerance.t# variables (where
# is 0 through 4) as the main outcome variables of interest. The authors constructed these dependent variables
tolerance.t# as indexes by using several other measures that are not included in this exercise. In building this index, the authors scaled the variables such that they have a mean of 0 and standard deviation of 1 for the placebo group. Higher values indicate higher tolerance, lower values indicate lower tolerance relative to the placebo group.
The data set is the file
transphobia.csv. Variables that begin with
vf_ come from the voter file. Variables in this dataset are described below:
||Intervention was actually delivered (=
||Outcome tolerance variable at Baseline|
||(see above) Captured at 3 days after Baseline|
||(see above) Captured at 3 weeks after Baseline|
||(see above) Captured at 6 weeks after Baseline|
||(see above) Captured at 3 months after Baseline|
Once you have downloaded the data, put the
.csv into the
data folder that you created last week, and then create a new R script for this week. Then load the data using the
read.csv() function as we did last week:
Remember, the above command assumes that your working directory is set to the same folder where you have created your new R script, and that the data file is saved in a folder named
data within the folder where your script lives. If you are having trouble setting this up, revisit last week’s seminar assignment.
For this question, we will learn to use the
summary() function, which provides basic summary statistics for one or more variables. Depending on whether the variable is interval level or nominal/ordinal, you will get a different looking summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.00 36.00 51.00 49.32 62.00 90.00
## Length Class Mode ## 488 character character
You can also use the
summary() function on the entire data frame
transphobia to get summaries of every variable at once.
The first thing we are going to do, building on ideas from last week, is check whether it looks like the experiment achieved balance between treatment and control on some of the other variables. Note that there are two treatment variables in the data set,
treatment.delivered. The former is whether some one was assigned to the treatment group, the latter is whether they actually received the treatment. Using the
table() command that we learned previously, we can calculate the cross-tabulation of these two variables:
## ## FALSE TRUE ## 0 241 11 ## 1 51 185
When you do the
table() command with two arguments, as in the above example, the first variable forms the rows, the second variable forms the columns. So the top right cell, for example, indicates that there were 11 people for whom
treat_ind = 0 and
treatment.delivered = 1. That is, people who received the treatment even though they were not supposed to. (Note: this happened because some of the canvassers made errors in following the experimental protocol.)
We can translate these into proportions using the
## ## FALSE TRUE ## 0 0.49385246 0.02254098 ## 1 0.10450820 0.37909836
By default, the prop.table command calculates proportions of the entire table. Frequently it is useful to just calculate these within rows or columns:
## ## FALSE TRUE ## 0 0.95634921 0.04365079 ## 1 0.21610169 0.78389831
We can see from these tables that there was non-compliance with the experimental protocol. Not everyone who was supposed to receive the treatment did so. (What proportion did, according to the above output?) Some people who were not supposed to receive the treatment, nonetheless did so. (What proportion, according to the above output?)
prop.table() function to assess whether there is any difference in the proportion of Democrats (D), Independents (N) and Republicans (R)
1. Between those who were supposed to receive treatment
treat_ind = 1 and those who were not
treat_ind = 0?
2. Between those who actually received treatment
treatment.delivered = TRUE and those who did not
treatment.delivered = FALSE?
## ## D N R ## 0 0.4642857 0.2539683 0.2817460 ## 1 0.5296610 0.2372881 0.2330508
The proportion of Republicans was somewhat lower (0.23 vs 0.28), and the proportion of Democrats somewhat higher (0.53 vs 0.46), among those who were randomly assigned to receive treatment vs those who were not.
## ## D N R ## FALSE 0.4417808 0.2705479 0.2876712 ## TRUE 0.5765306 0.2091837 0.2142857
The imbalance in treatment delivery was larger than the imbalance after treatment assignment. The proportion of Republicans was lower (0.21 vs 0.29), and the proportion of Democrats higher (0.58 vs 0.44), among those who actually received treatment vs those who did not.
The data set only includes individuals for whom the baseline survey was successfully conducted, however the study design involved four follow-up waves of the survey. Not all participants in the study were successfully contacted for all of these follow-up waves. As a result, there is missing data in the data set. You can see this immediately if you use the
head() command on the data set. There are
NA values for the measure of tolerance in several all post-baseline waves for the third observation in the data set, and for the final wave
t4 for the first observation in the data set.
## vf_age vf_party vf_racename vf_female tolerance.t0 tolerance.t1 ## 1 29 D African American 0 -1.94030332 -2.2304567 ## 2 35 D African American 1 -0.08454013 0.8072470 ## 3 72 N African American 1 -0.08454013 NA ## 4 63 D African American 1 -0.08454013 -0.0961481 ## 5 51 N Caucasian 1 0.14480496 -0.3215813 ## 6 26 D African American 1 -0.07212884 -0.0961481 ## tolerance.t2 tolerance.t3 tolerance.t4 treatment.delivered treat_ind ## 1 -0.679679229 -1.1695175 NA FALSE 0 ## 2 0.742843981 0.9462945 0.27998715 TRUE 1 ## 3 NA NA NA FALSE 0 ## 4 -0.157508021 0.0242335 -0.03044013 TRUE 1 ## 5 -0.156093923 -0.2441662 0.28148349 FALSE 0 ## 6 -0.008982646 -0.1092216 -0.31239390 FALSE 0
You cannot use the operator
== to test for the value
NA in R. For example, the command
transphobia$tolerance.t1 == NA will not tell you which observations are
NA for the variable
transphobia$tolerance.t1, but will simply return all
NA values. To identify which observations have
NA values, you must use the
##  -2.2304567 0.8072470 NA -0.0961481 -0.3215813 -0.0961481
##  FALSE FALSE TRUE FALSE FALSE FALSE
Thus, if you wanted to assess whether men or women were more likely to go missing in wave 1 (post-baseline), you could do the following:
## ## FALSE TRUE ## 0 0.8125000 0.1875000 ## 1 0.8892857 0.1107143
Apparently, while 19% of men (
female = 0) from the baseline survey were missing at wave 1, 11% of women (
female = 1) were missing at wave 1, suggesting different rates of attrition.
Use the functions described above and the variables
tolerance.t4 to assess whether
1. Men or women were more likely to be missing in each post-baseline wave of the survey
2. Democrats, Republicans or Independents were more likely to be missing in each post-baseline wave of the survey
## ## FALSE TRUE ## 0 0.8125000 0.1875000 ## 1 0.8892857 0.1107143 ## ## FALSE TRUE ## 0 0.7740385 0.2259615 ## 1 0.8035714 0.1964286 ## ## FALSE TRUE ## 0 0.7692308 0.2307692 ## 1 0.8142857 0.1857143 ## ## FALSE TRUE ## 0 0.7355769 0.2644231 ## 1 0.7928571 0.2071429
The proportion of men who failed to respond to subsequent waves of the survey is always higher than for women, but the difference doesn’t get any larger after the first wave
## ## FALSE TRUE ## D 0.8512397 0.1487603 ## N 0.8750000 0.1250000 ## R 0.8492063 0.1507937 ## ## FALSE TRUE ## D 0.7685950 0.2314050 ## N 0.8583333 0.1416667 ## R 0.7698413 0.2301587 ## ## FALSE TRUE ## D 0.7685950 0.2314050 ## N 0.8416667 0.1583333 ## R 0.8015873 0.1984127 ## ## FALSE TRUE ## D 0.7851240 0.2148760 ## N 0.8083333 0.1916667 ## R 0.6984127 0.3015873
The proportion of independents who are missing in later waves is always lower than for those who are Democrats or Republicans, but the relative levels of missingness for the latter two groups are inconsistent. They are similar in
t2, but then differ in opposite directions in
Let’s look at the distribution of tolerance in the first wave, and how it differs between those who were supposed to receive treatment and those who were not (
Mean seems to be higher in the treatment group than the control group.
##  0.009213218
##  0.1535359
- Check whether the means between the treatment and control group were already different in the baseline survey before the treatment was delivered.
- Calculate the difference in means between treatment and control groups for all waves of the survey.
##  0.009480114 ##  -0.03055836
They were different, but not by nearly as much as after the treatment. We will look at this in more detail in the homework.
mean(transphobia$tolerance.t0[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t0[transphobia$treat_ind == 0],na.rm=TRUE) mean(transphobia$tolerance.t1[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t1[transphobia$treat_ind == 0],na.rm=TRUE) mean(transphobia$tolerance.t2[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t2[transphobia$treat_ind == 0],na.rm=TRUE) mean(transphobia$tolerance.t3[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t3[transphobia$treat_ind == 0],na.rm=TRUE) mean(transphobia$tolerance.t4[transphobia$treat_ind == 1],na.rm=TRUE) - mean(transphobia$tolerance.t4[transphobia$treat_ind == 0],na.rm=TRUE)
##  -0.04003847 ##  0.1443226 ##  0.1173435 ##  0.2469228 ##  0.1762387
The effect of the treatment was persistent across waves / over time. This is one of the core results of this study.
One possibility is that the treatment could have been polarising, which is to say that it might not just change the average level of tolerance but also change the degree of dispersion in tolerance. Put differently, even if some people responded positively, others might have responded negatively. If we look at the histograms for
tolerance.t1 among those assigned to treatment (
treat_ind == 1) versus those who were not (
treat_ind == 0), we see some evidence that this could be the case:
Note that the
par() command above is used to set graphical parameters, in this case to put two plots in a 2 row x 1 column grid (
mfrow=c(2,1)), and then to return to the original single plot setting (
par() command has a bewildering array of options which you can look up in its help file.
sd() calculates the standard deviation for set of values in R.
- Replace the
mean()command in your code for Question 3 with
sd(), and calculate the difference in standard deviations between treatment and control groups in wave 0 and in wave 1.
##  0.114463 ##  0.1405
There is a difference in standard deviation between the treatment group and the control group in wave 1, but it appears to have been present (if very slightly weaker) in the baseline wave before the treatment happened, so there is little evidence of polarising effect of the treatment.
How similar are individuals’ tolerance scores across different waves? We might look at this visually, by using a scatter plot with the tolerance score for one wave on the x-axis, and the tolerance score for another wave on the y-axis:
Clearly these are positively correlated. We can calculate the correlation coefficient \(\rho\) with the command:
##  0.8290495
Both scatterplots and correlation coefficients compare two continuous/interval-level variables at a time, but sometimes it is useful to look at all the possible comparisons of a set of variables. In this case, we have five measurements of the tolerance scale, so we might want to look at all the possible pairwise comparisons at once. We can do that graphically using the
pairs() command, and a new way of selecting subsets of variables by name from a data frame:
We can also create a matrix of pairwise correlation coefficients using the following command. Note that it is not necessary to wrap the
cor() command in the
round() command, the latter simply makes it easier to read the correlations by rounding them to two digits.
## tolerance.t0 tolerance.t1 tolerance.t2 tolerance.t3 tolerance.t4 ## tolerance.t0 1.00 0.83 0.81 0.83 0.82 ## tolerance.t1 0.83 1.00 0.87 0.87 0.87 ## tolerance.t2 0.81 0.87 1.00 0.92 0.90 ## tolerance.t3 0.83 0.87 0.92 1.00 0.91 ## tolerance.t4 0.82 0.87 0.90 0.91 1.00
These correlations provide an additional confirmation that the treatment did something enduring. The correlations between the baseline wave
t0 and the other waves are generally lower than the correlations between the post-treatment waves
t3. That is, there was more change in tolerance scores at the individual-level between the baseline wave and the first post-treatment wave than there was between subsequent waves.
Define new variables called
missing_t2, etc for all of the post-treatment waves.
prop.table to assess whether a respondent going missing in later waves was associated with whether they were assigned to treatment.
Calculate the mean level of baseline wave tolerance among those who were missing in wave 1, and those who were not missing in wave 1. Why might it matter whether these are different?
transphobia$missing_t1 <- is.na(transphobia$tolerance.t1) transphobia$missing_t2 <- is.na(transphobia$tolerance.t2) transphobia$missing_t3 <- is.na(transphobia$tolerance.t3) transphobia$missing_t4 <- is.na(transphobia$tolerance.t4) prop.table(table(transphobia$treat_ind,transphobia$missing_t1),1) prop.table(table(transphobia$treat_ind,transphobia$missing_t2),1) prop.table(table(transphobia$treat_ind,transphobia$missing_t3),1) prop.table(table(transphobia$treat_ind,transphobia$missing_t4),1) mean(transphobia$tolerance.t0[transphobia$missing_t1 == TRUE]) mean(transphobia$tolerance.t0[transphobia$missing_t1 == FALSE])
## ## FALSE TRUE ## 0 0.8809524 0.1190476 ## 1 0.8305085 0.1694915 ## ## FALSE TRUE ## 0 0.8015873 0.1984127 ## 1 0.7796610 0.2203390 ## ## FALSE TRUE ## 0 0.8214286 0.1785714 ## 1 0.7669492 0.2330508 ## ## FALSE TRUE ## 0 0.8055556 0.1944444 ## 1 0.7288136 0.2711864 ##  -0.2131327 ##  0.02415432
Respondents assigned to receive the treatment were more likely to be missing in all subsequent waves of the survey. The treatment effect on missingness varies between 2 and 7 percentage points across the different waves.
Respondents who went missing after the first wave of the survey were substantially less tolerant in the baseline survey (0.23 points on the baseline tolerance scale). This is potentially a problem because if the treatment causes the less tolerant people to opt out of the subsequent waves of the survey, we might worry that any difference in tolerance between those who were treated and those who were not in subsequent waves is not because the treatment made people more tolerant, but instead because the treatment made the intolerant people drop out of the survey.
As we saw, there were some issues with the implementation of the study, including some compliance problems with treatment, and attrition problems that we just examined. As an alternative way of analyzing the data, do the following.
First, subset the entire data set to use only the observations which were not missing at
Second, using the variable
treatment.delivered as the treatment variable, and
tolerance_t1 as the before and after outcomes, treat the study as a difference in difference design. Calculate the mean value of
tolerance_t0 for each value of
treatment.delivered, and then calculate the difference in differences. Are the results generally similar to those we found previously?
transphobia2 <- transphobia[transphobia$missing_t1 == FALSE,] diff_in_diff <- (mean(transphobia2$tolerance.t1[transphobia2$treatment.delivered == TRUE]) - mean(transphobia2$tolerance.t0[transphobia2$treatment.delivered == TRUE])) - (mean(transphobia2$tolerance.t1[transphobia2$treatment.delivered == FALSE]) - mean(transphobia2$tolerance.t0[transphobia2$treatment.delivered == FALSE])) diff_in_diff
##  0.2034032
The difference-in-differences is 0.203, which is slightly larger than the simple difference in means at wave 1, but broadly similar. Despite the non-compliance with the treatment and attrition, the evidence is pretty strong that there was a causal effect of the treatment.
Use the data set you constructed in Question 2, which only includes respondents who were not missing at wave 1. Construct a scatterplot with age on the x-axis and
tolerance_t0 on the y-axis, using the observations for which
treatment.delivered == TRUE. Calculate the correlation between age and
tolerance_t0. Now, generate the same plot and correlation using
tolerance_t1 for the same observations. What does the change or lack of change in this correlation tell us about the likely effect of the treatment on people of different ages?
par(mfrow=c(1,2)) plot(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE], transphobia2$tolerance.t0[transphobia2$treatment.delivered == TRUE], main="Baseline Tolerance") plot(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE], transphobia2$tolerance.t1[transphobia2$treatment.delivered == TRUE], main="Post-Treatment Wave 1 Tolerance")
##  -0.2686877 ##  -0.1847027
The correlation between age and tolerance is negative in the baseline wave. This indicates that older people tend to have lower measured tolerance than younger people. It appears that the treatment reduces the magnitude of this negative correlation between age and transphobia, which indicates that the effect of the treatment was more positive on older people than on younger people.