2.2 Solutions

2.2.1 Exercise 1

Create a new file called assignment2.R in your PUBL0055 folder and write all the solutions in it.

In RStudio, go to the menu and select File > New File > R Script

Make sure to clear the environment and set the working directory.

rm(list = ls())
setwd("~/PUBL0055")

Go to the menu and select File > Save and name it assignment2.R

2.2.2 Exercise 2

Clear the workspace and set the working directory to your PUBL0055 folder.

rm(list = ls())
setwd("~/PUBL0055")

2.2.3 Exercise 3

Load the non-western foreigners dataset from your local drive into R.

load("non_western_foreingners.RData")

2.2.4 Exercise 4

What is the level of measurement for each variable in the non-western foreigners dataset?

Variable Level of measurement
IMMBRIT interval scaled (continuous)
over.estimate categorical with 2 categories, also called a binary variable,
a dummy variable, or an indicator variable
Rsex categorical as well with 2 categories
RAge interval scaled
Househld interval scaled
paper categorical with 2 categories
WWWhourspW interval scaled
religious categorical with 2 categories
employMonths interval scaled
urban an ordinal variable
health.good an ordinal variable
HHInc We do not have enough information to determine whether HHInc is
interval scaled or ordinal. If the income bands are equally large,
HHInc would be interval scaled. We will treat the variable as
interval scaled.

2.2.5 Exercise 5

Calculate the correct measure of central tendency for RAge, Househld, religious.

The correct measures of central tendency for the three levels of measurement are:

Level of measurement Central tendency
categorical Mode
ordinal Median
interval Mean
mean(fdata$RAge)
[1] 49.74547
mean(fdata$Househld)
[1] 2.391802
mean(fdata$religious)
[1] 0.4928503

The mean of age is 49.75, the mean of Househeld is 2.39. The mode of religious is 0. Note: Because religious is binary, taking the mean tells us what the mode is because we know the proportion of 1’s. 49.29% are religious, therefore, more people are not religious.

2.2.6 Exercise 6

Calculate the correct measure of dispersion for RAge, Househld, religious.

sd(fdata$RAge)
[1] 17.57245
sd(fdata$Househld)
[1] 1.339352
mean(fdata$religious)
[1] 0.4928503

The standard deviation of age is 17.57, the standard deviation of the number of people in the respondents household is 1.34. 49.29% of the respondents are religious and 50.71% are not.

2.2.7 Exercise 7

How many respondents identify with the Greens?

fdata$party_self <- factor(
  fdata$party_self, 
  labels = c("Tories", "Labour", "SNP", "Greens", "Ukip", "BNP", "other")
)

One solutions is to look at the frequency table

table(fdata$party_self)

Tories Labour    SNP Greens   Ukip    BNP  other 
   284    280     16     23     31     32    383 

Another solution using the which() function

length(which(fdata$party_self=="Greens"))
[1] 23

23 respondents identify with the Green party.

2.2.8 Exercise 8

Calculate the variance and standard deviation of IMMBRIT for each party affiliation.

Tories

var(fdata$IMMBRIT[fdata$party_self=="Tories"])
[1] 431.8308
sd(fdata$IMMBRIT[fdata$party_self=="Tories"])
[1] 20.78054

Labour

var(fdata$IMMBRIT[fdata$party_self=="Labour"])
[1] 444.8932
sd(fdata$IMMBRIT[fdata$party_self=="Labour"])
[1] 21.09249

SNP

var(fdata$IMMBRIT[fdata$party_self=="SNP"])
[1] 145
sd(fdata$IMMBRIT[fdata$party_self=="SNP"])
[1] 12.04159

Greens

var(fdata$IMMBRIT[fdata$party_self=="Greens"])
[1] 591.8103
sd(fdata$IMMBRIT[fdata$party_self=="Greens"])
[1] 24.32715

UKIP

var(fdata$IMMBRIT[fdata$party_self=="Ukip"])
[1] 288.2796
sd(fdata$IMMBRIT[fdata$party_self=="Ukip"])
[1] 16.9788

BNP

var(fdata$IMMBRIT[fdata$party_self=="BNP"])
[1] 657.1895
sd(fdata$IMMBRIT[fdata$party_self=="BNP"])
[1] 25.63571

Other

var(fdata$IMMBRIT[fdata$party_self=="other"])
[1] 434.8236
sd(fdata$IMMBRIT[fdata$party_self=="other"])
[1] 20.85242

2.2.9 Exercise 9

Find the party affiliation of the oldest and youngest respondents.

First, we find the age of oldest and youngest respondents

min(fdata$RAge)
[1] 17
max(fdata$RAge)
[1] 99

You can also use the range function which gives you both the min and max

range(fdata$RAge)
[1] 17 99

Then we get the row index of the oldest and youngest respondents

oldest <- which(fdata$RAge == max(fdata$RAge))
youngest <- which(fdata$RAge == min(fdata$RAge))

Finally we can get the party affiliation of those respondents

fdata$party_self[oldest]
[1] other  Labour
Levels: Tories Labour SNP Greens Ukip BNP other
fdata$party_self[youngest]
[1] other
Levels: Tories Labour SNP Greens Ukip BNP other

Two respondents were 99 years old. One identifies with a party other than the six parties we listed, and the other respondent indentifies with Labour.

The youngest respondent is 17 and identifies with a party other than the six we listed.

2.2.10 Exercise 10

Find the 20th, 40th, 60th and 80th percentiles of RAge.

quantile(fdata$RAge, c(.2, .4, .6, .8))
20% 40% 60% 80% 
 33  44  55  66 

2.2.11 Exercise 11

Create a box plot for IMMBRIT grouped by the paper variable to show the difference between IMMBRIT for people who read daily morning newspapers three or more times per week and people who do not.

boxplot( IMMBRIT ~ paper, data = fdata)

The two conditional distributions look identical. This plot shows no difference in the subjective number of immigrants for people who read daily morning newspapers and people who do not.

2.2.12 Exercise 12

What is the mean of IMMBRIT for men and for women?

Men

mean(fdata$IMMBRIT[fdata$RSex==1])
[1] 24.53766

Women

mean(fdata$IMMBRIT[fdata$RSex==2])
[1] 32.79159

The mean for men is 24.54 and the mean for women is 32.79

2.2.13 Exercise 13

What is the numerical difference between those two means?

mean(fdata$IMMBRIT[fdata$RSex==2]) - mean(fdata$IMMBRIT[fdata$RSex==1])
[1] 8.253937

The difference in means between women and men is 8.25 or put differently: women overestimate the number of immigrants more than men. The difference seems to be quite large at 8.25 per 100 (8.25 percentage points).