3 Description and measurement

3.1 Overview

In the lecture this week, we discuss descriptive statistics and visualisation of data, starting with single variables and then looking at examples with multiple variables.

In seminar this week, we will cover the following topics:

Calculating variable summaries, tables and correlations
Working with missing data
Scatterplots, boxplots, histograms

Before coming to the seminar

Please read:

Chapter 16 (“Measure Your Mission”), especially sections “Measuring the Wrong Outcome or Treatment” and “Strategic Adaptation and Changing Relationships”, in Thinking Clearly with Data (essential)
Sections 2.6, and 3.1 to 3.6, in Quantitative Social Science: An Introduction (recommended)

3.2 Seminar

Is a state’s use of indiscriminate violence associated with insurgent attacks? In today’s seminar, we will analyse the relationship between indiscriminate violence and insurgent attacks using data about Russian artillery fire in Chechnya from 2000 to 2005. The data in this exercise is based on Lyall, J. 2009. “Does Indiscriminate Violence Incite Insurgent Attacks?: Evidence from Chechnya.”. You can download the data by clicking on the link above.

The names and descriptions of variables in the data file chechen.csv are

Name	Description
`village`	Name of village
`groznyy`	Variable indicating whether a village is in Groznyy (1) or not (0)
`fire`	Whether Russians struck a village with artillery fire (1) or not (0)
`deaths`	Estimated number of individuals killed during Russian artillery fire or NA if not fired on
`preattack`	The number of insurgent attacks in the 90 days before being fired on
`postattack`	The number of insurgent attacks in the 90 days after being fired on

Note that the same village may appear in the dataset several times as shelled and/or not shelled because Russian attacks occurred at different times and locations.

Now load the data using the read.csv() function:

chechen <- read.csv("data/chechen.csv")

Question 1

For this question, we will learn to use the summary() function, which provides basic summary statistics for one or more variables. You could also use the summary() function on the entire data frame chechen to get summaries of every variable at once. Depending on whether the variable is interval level or nominal/ordinal, you will get a different looking summary:

summary(chechen$groznyy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06289 0.00000 1.00000

summary(chechen$village)

##    Length     Class      Mode 
##       318 character character

Using the summary() function, answer the following questions:

What are the mean and median values of the variables preattack and postattack?
What are the maximum and minimum values of the variables preattack and postattack?
How many missing values does the variable deaths have?

Reveal answer

summary(chechen$preattack)  
summary(chechen$postattack)
summary(chechen$deaths)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.132   2.750  27.000 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.774   2.000  22.000 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   1.667   1.500  34.000     159

The means of preattack and postattack are 2.132 and 1.774 insurgent attacks respectively, and the medians are both 0 insurgent attacks.
The minimum is 0 in both variables and 27 and 22 insurgent attacks in preattack and postattack, respectively.
There are 159 missing values in the variable deaths.

Alternatively, you could use the following commands to find out the same things:

# 1. 
mean(chechen$preattack)
mean(chechen$postattack)
median(chechen$preattack)
median(chechen$postattack)

# 2. 
range(chechen$preattack)
range(chechen$postattack)

# 3. 
table(is.na(chechen$deaths))

## [1] 2.132075
## [1] 1.773585
## [1] 0
## [1] 0
## [1]  0 27
## [1]  0 22
## 
## FALSE  TRUE 
##   159   159

Note that you could have answered these questions with some other functions that you learned over the last couple of weeks. Have another look at the lecture materials and previous seminars to find these. You may notice that, for numeric variables, the summary() command only gives the first and third quartile (i.e. the 25th and 75th percentile). We will now use one of the new functions we introduced in the lecture to find out more about the distribution of variables. For instance the quantile() function:

# shows the 0th, 25th, 50th, (median), 75th and 100th percentile:
quantile(chechen$postattack)

##   0%  25%  50%  75% 100% 
##    0    0    0    2   22

# we can also specify specific percentiles:
quantile(chechen$postattack, c(0.05, 0.95))

##   5%  95% 
## 0.00 9.15

Using the quantile() function, answer the following questions:

What is the 80$^{th}$ percentile value of the variable deaths? (Beware of the missing values! Hint: use na.rm=T)`
What is the interquartile range of the variable preattack?

Reveal answer

quantile(chechen$deaths,0.8, na.rm=T)

## 80% 
##   2

The 80th percentile value of deaths is 2.

q<- quantile(chechen$preattack)
q[4]-q[2]

##  75% 
## 2.75

The interquartile range of the variable preattack is 2.75 insurgent attacks

Question 2

Last week we used the function table() to get absolute frequencies for some variable combinations. We can translate absolute frequencies into proportions (i.e. relative frequencies) using the prop.table() command:

prop.table(table(chechen$groznyy,chechen$fire))

##    
##              0          1
##   0 0.45911950 0.47798742
##   1 0.04088050 0.02201258

By default, the prop.table command calculates proportions of the entire table. Frequently it is useful to just calculate these within rows or columns. We can do this by adding a comma and either 1 if we want to calculate proportions by row or 2 if we want to calculate proportions by column. Here, we want to calculate proportions by row:

prop.table(table(chechen$groznyy,chechen$fire),1) # by row

##    
##             0         1
##   0 0.4899329 0.5100671
##   1 0.6500000 0.3500000

prop.table(table(chechen$groznyy,chechen$fire),2) # by column

##    
##              0          1
##   0 0.91823899 0.95597484
##   1 0.08176101 0.04402516

Use the prop.table() function shown above to answer the following questions:

Among the villages located in Groznyy, what proportion was shelled?
Among the villages that were shelled, what proportion was not located in Groznyy?
What is the proportion of villages that were shelled, irrespective of whether they are located in Groznyy or not? (Hint: Use prop.table(table()) for only one variable.)

Reveal answer

The proportion of villages in Groznyy that were shelled is 0.35.
The proportion of the shelled villages that are not located in Groznyy is 0.96.

prop.table(table(chechen$fire))

## 
##   0   1 
## 0.5 0.5

The proportion of shelled villages is 0.5.

Question 3

As we saw above, the variable deaths has missing values, which makes sense as there can be not causalities from the Russians’ attack where there was not Russian attack. You can see this immediately if you use the head() command on the data set.

head(chechen$deaths)

## [1] NA  0 34 NA NA  0

You cannot use the operator == to test for the value NA in R. For example, the command chechen$deaths == NA will not tell you which observations are NA for the variable chechen$deaths, but will simply return all NA values. To identify which observations have NA values, you must use the is.na() function.

head(is.na(chechen$deaths))

## [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

Combining the commands is.na() and prop.table(), check if the missing values are indeed only in places where there was no Russian attack.

Reveal answer

prop.table(table(chechen$fire,is.na(chechen$deaths)))

##    
##     FALSE TRUE
##   0   0.0  0.5
##   1   0.5  0.0

We can clearly see here that all missing values (half of our data) for deaths are were there was no Russian shelling and all non-missing values are were there was Russian shelling.

Question 4

Let’s now make some visualisations at the distribution of insurgent attacks before and after the Russian shelling, and how it differs between those villages who were shelled and those who were not. In the lecture, we used the commands boxplot() and hist(). For example, you could run:

boxplot(chechen$preattack)

hist(chechen$preattack)

Using these commands, produce the following plots, making sure to add some proper axis labels:

A histogram of the variable postattack. Comment on one thing you find noteworthy about the distribution of the data.
A boxplot of the variable postattack. Briefly discuss what the boxplot tells you that the histogram does not.
Two boxplots of the variable postattack for those villages that were struck by artillery fire and those that were not. (Hint: boxplot(variable1 ~ variable2, data = data_name))

Reveal answer

hist(chechen$postattack,
     main = "Histogram of post-shelling insurgency",
     xlab = "Number of post-shelling insurgent attacks")

We notice from the histogram that the values of post-shelling attacks is positively skewed, or skewed to the right this means that most villages had few post-shelling insurgency attacks, while a small number of villages had a large number of post-shelling insurgency attacks.

boxplot(chechen$postattack,
     main = "Boxplot of post-shelling insurgency",
     ylab = "Number of post-shelling insurgent attacks")

The boxplot allows us to easily see the median and middle 50 per cent of values for the number of post-shelling insurgency attacks, as well as show outliers.

boxplot(chechen$postattack~chechen$fire,
     main = "Boxplot of post-shelling insurgency and artillery fire",
     xlab = "Struck by artillery fire",
     names = c("no", "yes"),
     ylab = "Number of post-shelling insurgent attacks")

We notice that the median for the number of post-shelling insurgency attacks is similar whether or not a village was struck by artillery file. Looking at the inter-quartile ranges, we notice that there is more variation among villages that were not struck by artillery fire. We can also see that there are a number of outliers for both boxplots.

3.3 Homework

The data for this week’s homework is the same as we used in the seminar last week. Use the read.csv() function as we did in class:

transphobia <- read.csv("data/transphobia.csv")

Question 1

Define new variables called missing_t1, missing_t2, etc that tell us for all of the post-treatment waves which observations have missing data in these (i.e., for which observations tolerance was not recorded).
Use table() and prop.table to assess whether a respondent going missing in later waves was associated with whether they were assigned to treatment.
Calculate the mean level of baseline wave tolerance among those who were missing in wave 1, and those who were not missing in wave 1. Why might it matter whether these are different?

Reveal answer

transphobia$missing_t1 <- is.na(transphobia$tolerance.t1)
transphobia$missing_t2 <- is.na(transphobia$tolerance.t2)
transphobia$missing_t3 <- is.na(transphobia$tolerance.t3)
transphobia$missing_t4 <- is.na(transphobia$tolerance.t4)

prop.table(table(transphobia$treat_ind,transphobia$missing_t1),1)
prop.table(table(transphobia$treat_ind,transphobia$missing_t2),1)
prop.table(table(transphobia$treat_ind,transphobia$missing_t3),1)
prop.table(table(transphobia$treat_ind,transphobia$missing_t4),1)

mean(transphobia$tolerance.t0[transphobia$missing_t1 == TRUE])
mean(transphobia$tolerance.t0[transphobia$missing_t1 == FALSE])

##    
##         FALSE      TRUE
##   0 0.8809524 0.1190476
##   1 0.8305085 0.1694915
##    
##         FALSE      TRUE
##   0 0.8015873 0.1984127
##   1 0.7796610 0.2203390
##    
##         FALSE      TRUE
##   0 0.8214286 0.1785714
##   1 0.7669492 0.2330508
##    
##         FALSE      TRUE
##   0 0.8055556 0.1944444
##   1 0.7288136 0.2711864
## [1] -0.2131327
## [1] 0.02415432

Respondents assigned to receive the treatment were more likely to be missing in all subsequent waves of the survey. The treatment effect on missingness varies between 2 and 7 percentage points across the different waves.

Respondents who went missing after the first wave of the survey were substantially less tolerant in the baseline survey (0.23 points on the baseline tolerance scale). This is potentially a problem because if the treatment causes the less tolerant people to opt out of the subsequent waves of the survey, we might worry that any difference in tolerance between those who were treated and those who were not in subsequent waves is not because the treatment made people more tolerant, but instead because the treatment made the intolerant people drop out of the survey.

Question 2

Subset the entire data set to use only the observations which were not missing at t1. Then construct a scatterplot with age on the x-axis and tolerance_t0 on the y-axis, using the observations for which treatment.delivered == TRUE. Calculate the correlation between age and tolerance_t0. Now, generate the same plot and correlation using tolerance_t1 for the same observations. What does the change or lack of change in this correlation tell us about the likely effect of the treatment on people of different ages?

Reveal answer

transphobia2 <- transphobia[transphobia$missing_t1 == FALSE,]

par(mfrow=c(1,2))
plot(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t0[transphobia2$treatment.delivered == TRUE],
     main="Baseline Tolerance")
plot(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t1[transphobia2$treatment.delivered == TRUE],
     main="Post-Treatment Wave 1 Tolerance")

cor(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t0[transphobia2$treatment.delivered == TRUE])

cor(transphobia2$vf_age[transphobia2$treatment.delivered == TRUE],
     transphobia2$tolerance.t1[transphobia2$treatment.delivered == TRUE])

## [1] -0.2686877
## [1] -0.1847027

The correlation between age and tolerance is negative in the baseline wave. This indicates that older people tend to have lower measured tolerance than younger people. It appears that the treatment reduces the magnitude of this negative correlation between age and transphobia, which indicates that the effect of the treatment was more positive on older people than on younger people.

Question 3

One possibility is that the treatment could have been polarising, which is to say that it might not just change the average level of tolerance but also change the degree of dispersion in tolerance. Put differently, even if some people responded positively, others might have responded negatively. If we look at the histograms for tolerance.t1 among those assigned to treatment (treat_ind == 1) versus those who were not (treat_ind == 0), we see some evidence that this could be the case:

par(mfrow=c(2,1)) 
hist(transphobia$tolerance.t1[transphobia$treat_ind == 0])
hist(transphobia$tolerance.t1[transphobia$treat_ind == 1])

par(mfrow=c(1,1))

Note that the par() command above is used to set graphical parameters, in this case to put two plots in a 2 row x 1 column grid (mfrow=c(2,1)), and then to return to the original single plot setting (mfrow=c(1,1)). The par() command has a bewildering array of options which you can look up in its help file.

The command sd() calculates the standard deviation for set of values in R.
Calculate the difference in standard deviations between treatment and control groups in wave 0 and in wave 1.

Reveal answer

(sd(transphobia$tolerance.t0[transphobia$treat_ind == 1], na.rm=TRUE) - 
  sd(transphobia$tolerance.t0[transphobia$treat_ind == 0], na.rm=TRUE))

(sd(transphobia$tolerance.t1[transphobia$treat_ind == 1], na.rm=TRUE) - 
    sd(transphobia$tolerance.t1[transphobia$treat_ind == 0], na.rm=TRUE))

## [1] 0.114463
## [1] 0.1405

There is a difference in standard deviation between the treatment group and the control group in wave 1, but it appears to have been present (if very slightly weaker) in the baseline wave before the treatment happened, so there is little evidence of polarising effect of the treatment.

Question 4

How similar are individuals’ tolerance scores across different waves?

We can look at the relationship between the two waves visually by using a scatter plot with the tolerance score for one wave on the x-axis, and the tolerance score for another wave on the y-axis:

plot(transphobia$tolerance.t0,transphobia$tolerance.t1)

Clearly these are positively correlated. We can calculate the correlation coefficient $\rho$ with the command:

cor(transphobia$tolerance.t0,transphobia$tolerance.t1,use="pairwise.complete.obs")

## [1] 0.8290495

Create scatter plots to look at the relationship between subsequent waves.
What is the correlation coefficient between the waves?

Reveal answer

We can look at the relationship between the next waves visually by using a scatter plots. As we are making three scatter plots, we can use the par() command for three plots in a 3 row x 1 column grid (mfrow=c(3,1)).

par(mfrow=c(3,1))
plot(transphobia$tolerance.t1,transphobia$tolerance.t2)
plot(transphobia$tolerance.t2,transphobia$tolerance.t3)
plot(transphobia$tolerance.t3,transphobia$tolerance.t4)

par(mfrow=c(1,1))

We can calculate the correlation coefficient between the waves using the cor() function:

cor(transphobia$tolerance.t1,transphobia$tolerance.t2,use="pairwise.complete.obs")
cor(transphobia$tolerance.t2,transphobia$tolerance.t3,use="pairwise.complete.obs")
cor(transphobia$tolerance.t3,transphobia$tolerance.t4,use="pairwise.complete.obs")

## [1] 0.8749074
## [1] 0.9175045
## [1] 0.9128597

The scatter plots and correlations provide an additional confirmation that the treatment did something enduring.