8 Unsupervised Scale Measurement I: Interval-Level Indicators

Topics: Learning scale weights from sample covariation. Principle Components Analysis (PCA). Exploratory Factor Analysis (EFA).

Required reading:

8.1 Seminar

In this week’s assignment, we are going to look at a set of variables describing the economic characteristics of UK (excluding Northern Ireland) parliamentary constituencies around 2017-2019 (the dates of the source data vary a bit in terms of year).

This data file has 8 variables

  • ONSConstID - Office for National Statistics Parliamentary Constituency ID
  • ConstituencyName - Constituency Name
  • HouseWageRatio - Ratio of House Prices to Wages
  • UnempConstRate - Unemployment Rate
  • UnempConstRateChange - Unemployment Rate Change since 2010
  • WageMedianConst - Median Wage
  • social_mobility_score - Social Mobility Index
  • deprivation_index_score - Social Deprivation Index

There is some missing data for some of these variables for Scotland and Wales, so we will exclude those. Use the following command to remove them from the data:

econ_vars <- read.csv("4_data/week-7-econ-vars.csv")
econ_vars <- econ_vars[is.element(substr(econ_vars$ONSConstID,1,1),c("E")),]
  1. Assess the correlations between the six economic variables in the data set. Which two economic variables are most highly correlated with one another at the constituency level? Which variable is least correlated with the others at the constituency level? Hint: Use cor() to get the pairwise correlations and either pairs() or ggpairs() from the GGally package to plot them.
  1. Use the command pcafit <- prcomp(econ_vars[,4:9],scale.=TRUE) to calculate the principal components of these six economic variables. Then examine the object pcafit directly and also through summary(pcafit). Which variable is has the smallest (magnitude) “loading” on the first principal component? How does this relate to your answer in Q1?
  1. Construct screeplots using either the type="barplot" or the type="lines" options of the screeplot() command. Given this and the output of summary(pcafit) above, is it clear how many dimensions are needed to describe these data well?
  1. Check that the signs of the loadings for PC1 for each variable in the model. For each variable, write sentences of the form “[The ratio of home prices to wages] are [positive/negatively] correlated with the first principal component”. Do these all make sense collectively? You could also try writing a sentence of the form: “Places that are high on the first principal component are [high/low] in the house to wage ratio, [high/low] in unemployment,…” What does this tell us about what the first principal component is measuring?
  1. Are you able to identify what PC2 is capturing?
  1. Re-do the principal components analysis without the variable that has the smallest magnitude loading on the first principal component. Extract the first principal component from the original analysis with all six variables (using pcafit$x[,1]) and also from this new analysis with five variables. Plot them against one another and check their correlation. Explain why you find what you find.

References

Bartholomew, David J, Fiona Steele, Jane Galbraith, and Irini Moustaki. 2008. Analysis of Multivariate Social Science Data. Chapman; Hall/CRC.
Everitt, Brian, and Torsten Hothorn. 2011. An Introduction to Applied Multivariate Analysis with r. Springer Science & Business Media.