2 Descriptive Statistics


2.1 Seminar

Create a new script with the following lines at the top, and save it as seminar2.R

rm(list = ls())
setwd("~/PUBL0055")

What are these two lines of code doing? (It is worth paying attention here, because it will make the next 9 weeks easier!)

  1. rm(list = ls()) is just telling R to remove everything from your current environment. For instance, if you create an object like we did in last week’s seminar using something like my_object <- seq(1,10,1), and then you run rm(list = ls()), that object will disappear from the environment panel in RStudio and you will no longer be able to access it. We normally put this line at the top of each script we work with so that we are beginning our analysis fresh each time.
  2. setwd("~/PUBL0055") tells R that you would like to work from (“set”) the folder (or, “working directory”) that was created last week when you ran the initial setup script.
    CAUTION: Do not move or rename this folder unnecessarily. But if you subsequently moved this folder to somewhere else on your computer, you will need to modify the setwd line. For example, if you took the PUBL0055 folder and put it inside your Academic folder on your computer, you would need to run setwd("~/Academic/PUBL0055") instead.

2.1.1 Loading data

The first step in any data analysis project is to load the data into your software of choice. Data come in many different file formats such as .csv, .tab, .dta, etc. Today we will load a dataset which is stored in R’s native file format: .RData. The function to load data from this file format is called: load(). If you managed to set your working directory correctly just now (setwd("~/PUBL0055")), then you should just be able to run the line of code below.

load("non_western_foreingners.RData")

The non-western foreigners data is about the subjective perception of immigrants from non-western countries.

Let’s check the codebook of our data.

Variable Description
IMMBRIT Out of every 100 people in Britain.
How many do you think are immigrants from non-western countries?
over.estimate 1 if estimate is higher than 10.7%.
RSex 1 = male, 2 = female
RAge Age of respondent
Househld Number of people living in respondent’s household
party_self 1 = Conservatives
2 = Labour
3 = SNP
4 = Greens
5 = Ukip
6 = BNP
7 = other
paper Do you normally read any daily morning newspaper 3+ times/week?
WWWhourspW How many hours WWW per week?
religious Do you regard yourself as belonging to any particular religion?
employMonths How many mnths w. present employer?
urban Population density, 4 categories (highest density is 4, lowest is 1)
health.good How is your health in general for someone of your age?
0 = bad
1 = fair
2 = fairly good
3 = good)
HHInc Income bands for household, high number = high HH income

We can look at the variable names in our data with the names() function.

names(fdata)
 [1] "IMMBRIT"       "over.estimate" "RSex"          "RAge"         
 [5] "Househld"      "paper"         "WWWhourspW"    "religious"    
 [9] "employMonths"  "urban"         "health.good"   "HHInc"        
[13] "party_self"   

The dim() function can be used to find out the dimensions of the dataset (dimension 1 = rows, dimension 2 = columns).

dim(fdata)
[1] 1049   13

So, the dim() function tells us that we have data from 1049 respondents with 13 variables for each respondent.

Let’s take a quick peek at the first 10 observations to see what the dataset looks like. By default the head() function returns the first 6 rows, but let’s tell it to return the first 10 rows instead.

head(fdata, n = 10)
   IMMBRIT over.estimate RSex RAge Househld paper WWWhourspW religious
1        1             0    1   50        2     0          1         0
2       50             1    2   18        3     0          4         0
3       50             1    2   60        1     0          1         0
4       15             1    2   77        2     1          2         1
5       20             1    2   67        1     0          1         1
6       30             1    1   30        4     1         14         0
7       60             1    2   56        2     0          5         1
8        7             0    1   49        1     1          8         0
9       30             1    1   40        4     0          3         1
10       2             0    1   61        3     1          0         1
   employMonths urban health.good HHInc party_self
1            72     4           1    13          2
2            72     4           2     3          7
3           456     3           3     9          7
4            72     1           3     8          7
5            72     3           3     9          7
6            72     1           2     9          7
7           180     1           2    13          3
8           156     4           2    14          7
9           264     2           2    11          3
10           72     1           3     8          1

We can visualize the data with the help of a boxplot, so let’s see how the perception of the number of immigrants is distributed.

# how good are we at guessing immigration
boxplot(
  fdata$IMMBRIT, 
  main = "Perception of Immigration from Non-Western Countries",
  ylab = "Subjective number of immigrants per 100 British", 
  frame.plot = FALSE, col = "darkgray"
)

Notice how the $ (dollar sign) is used to access the IMMBRIT variable from fdata. Our data is stored in what is called a data frame in R. We can think of a data frame as a table of rows and columns and similar to what most of us have seen as spreadsheets in programs like Microsoft Excel. Data frames can be created manually with the data.frame() function but are usually created by functions like read.csv() or read_excel() that load datasets from a file or from a web URL.

In the non-western foreignerns dataset, every column has a name and can be accessed using this $ (dollar sign) syntax. For example, household income can be accessed with fdata$HHInc.

2.1.2 Central Tendency

2.1.2.1 Arithmetic mean and median

The arithmetic mean (or average) is the most widely used measure of central tendency. We all know how to calculate the mean by dividing the sum of all observations by the number of observations. In R, we can get the sum with the sum() function:

sum(fdata$IMMBRIT)
[1] 30453

and the number of observations with the length() function:

length(fdata$IMMBRIT)
[1] 1049

So if we were interested in calculating the mean of the subjective proportion of immigrants in Britian (the true level at the time was 10.7), we could manually do the calculation of 30453/1049, but we can also combine these to functions in R in the following way:

sum(fdata$IMMBRIT) / length(fdata$IMMBRIT)
[1] 29.03051

R provides us with an even easier way to do the same with a function called mean(). Similarly, R also provides a median() function for finding the median. Now go ahead and calculate mean and median of IMMBRIT using those two functions.

immbrit_mean <- mean(fdata$IMMBRIT)
immbrit_mean
[1] 29.03051
immbrit_median <- median(fdata$IMMBRIT)
immbrit_median
[1] 25

2.1.2.2 Mode

There is no built-in function in base-R that returns the mode from a dataset. There are numerous packages that provide a mode function, but it’s quite simple to do it ourselves and it’s a good exercise for exploring some of R’s data manipulation functions.

We start by tabulating the data first and since our dataset is quite small, we can easily spot the mode with visual inspection. We will look at household income. The first row is the value of the variable and the second is its frequency in the data.

hhinc <- table(fdata$HHInc)
hhinc

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
 19  47  50  55  61  73  55  33 212  43  48  32  66  49  38  39 129 

The table() function tells us the number of respondents that fall into a household income band (the higher the value the richer the respondent). For example, 19 respondents are in income band 1, 47 fall into income band 2 and so on. The mode in this case is 9 which occurs 212 times.

But we’d really like to find the mode programmatically instead of manually scanning a long list of numbers. So let’s sort hhinc from largest to the smallest so we can pick out the mode. The sort() function allows us to do that with the decreasing = TRUE option.

sorted_hhinc <- sort(hhinc, decreasing = TRUE)
sorted_hhinc

  9  17   6  13   5   4   7   3  14  11   2  10  16  15   8  12   1 
212 129  73  66  61  55  55  50  49  48  47  43  39  38  33  32  19 

Because we sorted our list from largest to the smallest, the mode is the first element in the list.

hhinc_mode <- sorted_hhinc[1]
hhinc_mode
  9 
212 

2.1.3 Factor Variables

Next, we want to find the maximum guess of the percentage of immigrants and we also want to see the party affiliation of the corresponding respondents. Our party affiliation variable party_self is a categorical variable. Each number (see codebook at the top) refers to a qualitative category. The numbers themselves don’t imply an ordering. In R, when categorical variables are stored as numeric data and if there are more than two distinct categories, we must convert them to factor variables to ensure that categorical data are handled correctly in functions that implement statistical models, tables and graphs.

We can convert categorical variables to factor variables using the factor() function. The factor() function needs the categorical variable and the distinct labels for each category (such as “Conservative”, “Labour”, “SNP” etc.) as the two arguments for creating factor variables.

fdata$party_self <- factor(
  fdata$party_self, 
  labels = c("Tories", "Labour", "SNP", "Greens", "Ukip", "BNP", "other")
)

Let’s examine the new factor variable with a frequency table.

table(fdata$party_self)

Tories Labour    SNP Greens   Ukip    BNP  other 
   284    280     16     23     31     32    383 

Back to our original task: finding the party affiliation of those respondents who answered the maximum to the immigrant question. The max() function tells us the maximum value of IMMBRIT, and with the which() function, we identify the respondents who answered it. Notice, the double equal sign is a logical evaluation that can be true or false (check your cheat-sheet for other logical conditions). We can read the line with the which() function below as: in which observations (rows) in the data set is the value of the IMMBRIT variable equal to the maximum of the IMMBRIT variable.

max(fdata$IMMBRIT)
[1] 100
which(fdata$IMMBRIT == max(fdata$IMMBRIT))
[1] 346 573

Last week we learned that we can index datasets using the square brackets or the dollar sign. We can also combine the two. Below, we index the rows in square brackets and the column with the dollar sign.

imm_max_resp <- which(fdata$IMMBRIT == max(fdata$IMMBRIT))
party_imm_max_resp <- fdata[imm_max_resp, ]$party_self
party_imm_max_resp
[1] Labour Labour
Levels: Tories Labour SNP Greens Ukip BNP other

Now we know that the two respondents who think there is a 1:1 proportion of non-western immigrants to British people both self-identify with the Labour party.

2.1.4 Dispersion

While measures of central tendency represent the typical values for a dataset, we’re often interested in knowing how spread apart the values in the dataset are. The range is the simplest measure of dispersion as it tells us the difference between the highest and lowest values in a dataset. Let’s continue with the perception of immigration example and find out the range of our dataset using the range() and diff() functions.

Your turn, find the maximum and minimum of IMMBRIT using the range() function and get the range by taking the difference of minimum and maximum with the diff() function.

immi_range <- range(fdata$IMMBRIT)
diff(immi_range)
[1] 100

Using a combination of range() and diff() functions we see that the difference between the two extremes (i.e. the highest and the lowest scores) in our dataset is 100.

2.1.4.1 Variance

Although the range gives us the difference between the two end points of the datasets, it still doesn’t quite give us a clear picture of the spread of the data. What we are really interested in is a measure that takes into account how far apart each immigration guess is from the mean. This is where the concept of variance comes in.

We can calculate the variance in R with the var() function. Take the variance of IMMBRIT.

var(fdata$IMMBRIT)
[1] 443.6632

2.1.4.2 Standard Deviation

The standard deviation is simply the square root of the variance we calculated in the last section. We could manually take the square root of the variance to find the standard deviation, but R also provides a convenient sd() function.

Take the standard deviation of IMMBRIT.

sqrt(var(fdata$IMMBRIT)) # one way 
[1] 21.06331
sd(fdata$IMMBRIT)  # another way
[1] 21.06331

2.1.4.3 Percentiles and Quantiles

The median value of a dataset is also known as the 50th percentile. It simply means that 50% of values are below the median and 50% are above it. In addition to the median, the 25th percentile (also known as the lower quartile) and 75th percentile (or the upper quartile) are commonly used to describe the distribution of values in a number of fields including standardized test scores, income and wealth statistics and healthcare measurements such as a baby’s birth weight or a child’s height compared to their respective peer group.

We can calculate percentiles using the quantile() function in R. For example, if we wanted to see the subjective number of immigrants in the 25th, 50th and 75th percentiles, we would call the quantile() function with c(0.25, 0.5, 0.75) as the second argument.

quantile(fdata$IMMBRIT, c(0.25, 0.5, 0.75))
25% 50% 75% 
 10  25  40 

The quantile() function tells us that 25% of respondents think the number is 10, 50% think the number is 25 or below and 75% think there are 40 or below immigrants per 100 British citizens.

2.1.5 Visualizing Data

Do people who self-identify with Ukip also think that there are more immigrants than people who identify with the Greens? To get a first idea, we can illustrate the subjective number of immigrants by party affiliation.

plot(fdata$IMMBRIT ~ fdata$party_self, 
     las = 2, 
     col = c("blue", "red", "yellow", "green", "purple", "brown", "gray"),
     frame.plot = FALSE, 
     xlab = "", 
     ylab = "Subj. # non-western immigrants per 100 British")

Now let’s see if we can visually examine the relationship between the number of months an employee has been with their employer and age using a scatter plot. Before we call the plot() function we need to reset the graphical parameters to make sure that our figure only contains a single plot by calling the par() function and using the mfrow=c(1,1) option.

plot(fdata$employMonths ~ fdata$RAge)

2.1.7 Exercises

  1. Create a new file called assignment2.R in your PUBL0055 folder and write all the solutions in it.
  2. Clear the workspace and set the working directory to your PUBL0055 folder.
  3. Load the non-western foreigners dataset.
  4. What is the level of measurement for each variable in the non-western foreigners dataset?
  5. Calculate the correct measure of central tendency for RAge, Househld, religious.
  6. Calculate the correct measure of dispersion for RAge, Househld, religious.
  7. How many respondents identify with the Greens?
  8. Calculate the variance and standard deviation of IMMBRIT for each party affiliation.
  9. Find the party affiliation of the oldest and youngest respondents.
  10. Find the 20th, 40th, 60th and 80th percentiles of RAge.
  11. Create a box plot for IMMBRIT grouped by the paper variable to show the difference between IMMBRIT for people who read daily morning newspapers three or more times per week and people who do not.
  12. What is the mean of IMMBRIT for men and for women?
  13. What is the numerical difference between those two means?
  14. Save your script, which should now include the answers to all the exercises.
  15. Source your script, i.e. run the entire script all at once. Fix the script if you get any error messages.