8 Unsupervised Scale Measurement I: Interval-Level Indicators
Topics: Learning scale weights from sample covariation. Principle Components Analysis (PCA). Exploratory Factor Analysis (EFA).
Required reading:
- Chapter 11, Pragmatic Social Measurement
Further reading:
Theory
- James et al. (2013), Ch 10-10.2
- Everitt and Hothorn (2011), Ch 3 & 5
- Bartholomew et al. (2008), Ch 5 & 7
Applications
–>
8.1 Seminar
In this week’s assignment, we are going to look at a set of variables describing the economic characteristics of UK (excluding Northern Ireland) parliamentary constituencies around 2017-2019 (the dates of the source data vary a bit in terms of year).
This data file has 8 variables
ONSConstID
- Office for National Statistics Parliamentary Constituency IDConstituencyName
- Constituency NameHouseWageRatio
- Ratio of House Prices to WagesUnempConstRate
- Unemployment RateUnempConstRateChange
- Unemployment Rate Change since 2010WageMedianConst
- Median Wagesocial_mobility_score
- Social Mobility Indexdeprivation_index_score
- Social Deprivation Index
There is some missing data for some of these variables for Scotland and Wales, so we will exclude those. Use the following command to remove them from the data:
econ_vars <- read.csv("4_data/week-7-econ-vars.csv")
econ_vars <- econ_vars[is.element(substr(econ_vars$ONSConstID,1,1),c("E")),]
- Assess the correlations between the six economic variables in the data set. Which two economic variables are most highly correlated with one another at the constituency level? Which variable is least correlated with the others at the constituency level? Hint: Use
cor()
to get the pairwise correlations and eitherpairs()
orggpairs()
from theGGally
package to plot them.
- Use the command
pcafit <- prcomp(econ_vars[,4:9],scale.=TRUE)
to calculate the principal components of these six economic variables. Then examine the objectpcafit
directly and also throughsummary(pcafit)
. Which variable is has the smallest (magnitude) “loading” on the first principal component? How does this relate to your answer in Q1?
- Construct screeplots using either the
type="barplot"
or thetype="lines"
options of thescreeplot()
command. Given this and the output ofsummary(pcafit)
above, is it clear how many dimensions are needed to describe these data well?
- Check that the signs of the loadings for PC1 for each variable in the model. For each variable, write sentences of the form “[The ratio of home prices to wages] are [positive/negatively] correlated with the first principal component”. Do these all make sense collectively? You could also try writing a sentence of the form: “Places that are high on the first principal component are [high/low] in the house to wage ratio, [high/low] in unemployment,…” What does this tell us about what the first principal component is measuring?
- Are you able to identify what PC2 is capturing?
- Re-do the principal components analysis without the variable that has the smallest magnitude loading on the first principal component. Extract the first principal component from the original analysis with all six variables (using
pcafit$x[,1]
) and also from this new analysis with five variables. Plot them against one another and check their correlation. Explain why you find what you find.
References
Bartholomew, David J, Fiona Steele, Jane Galbraith, and Irini Moustaki. 2008. Analysis of Multivariate Social Science Data. Chapman; Hall/CRC.
Everitt, Brian, and Torsten Hothorn. 2011. An Introduction to Applied Multivariate Analysis with r. Springer Science & Business Media.