10 Unsupervised Class Measurement
Topics: Unsupervised classification (clustering, latent class analysis).
Required reading:
- Chapter 13, Pragmatic Social Measurement
Further reading:
Theory
- Bartholomew et al. (2008), Ch 2 & 10
- Everitt and Hothorn (2011), Ch 6
- James et al. (2013), Ch 10.3
- Linzer & Lewis, “poLCA: An R Package for Polytomous Variable Latent Class Analysis”, Journal of Statistical Software, 42:10, 2011. https://www.jstatsoft.org/htaccess.php?volume=42&type=i&issue=10&paper=true
- Justin Grimmer and Gary King. “General purpose computer-assisted clustering and conceptualization” PNAS February 15, 2011 108 (7) 2643-2650
- John S Ahlquist and Christian Breunig. “Model-based Clustering and Typologies in the Social Sciences
10.1 Seminar
The data for this assignment are an extract from the US Bureau of Economic Analysis data on the distribution of employment by sector across all 51 US states (incl. Washington DC) in 2018. You can directly load the data file into R from the web with the following command:
load("week-10-bea-employment-by-state.Rdata")
bea <- 100*bea # convert proportions to percentages, for easier interpretation
It is recommended to multiply bea
by 100 to convert everything to percentage points, as it makes the numbers easier to read at various points, but you can leave the values as proportions if you prefer.
For this assignment, you will also need to install and load the R package mclust
:
- What is the standard deviation of employment levels in each sector, across all US states? What do we learn from the relative magnitudes of these standard deviations?
- If we are using unsupervised methods like principle components analysis or k-means clustering, what is the argument for doing our analysis on unstandardized levels of employment in each sector? What is the argument for doing analysis on standardized levels of employment in each sector, where the standardization is across states such that each sector has mean 0 and standard deviation 1?
- Use principle components analysis (see class assignment 9) on these data, using all indicators. Fit and save the principle components analysis with and without standardising the indicators. Plot the first (x-axis) and second (y-axis) principle components, with state name labels. Do two of these plots, one for PCA on standardised indicators and one for PCA on unstandardised indicators. You can create a new dataset with standardised indicators as below, but you can also make this choice using the argument
scale.=
ofprcomp()
:
- Investigate the obvious outlier that will be apparent in the PCA plots. Which indicators lead to this unit being an outlier? What (extremely basic) fact about American political geography explains why this unit is an outlier?
- Use k-means clustering on these data, using all indicators and \(k=5\) clusters. As with PCA, fit and save the k-means cluster analysis with and without standarising the indicators. Use the
nstart = 50
option forkmeans()
to ensure that you get a stable result (this will fit the algorithm 50 times from different start values, and select the best fitting one). Update your principle components plots, colouring by cluster.
- Describe the five clusters that you find using the unstandardised indicators. What distinguishes them from one another? You will need to explore a bit in the indicator data, and also look at the cluster means
your_kmeans_object$centers
. You might want to make a nice table of the latter, rounding the values appropriately for easy reading and comparison.
- As part of your investigations you could use the following code to map the clusters. Note that you need to assign your cluster assignments to
cluster_assignments
with something likecluster_assignments <- your_kmeans_object$cluster
in order for this code to work.
library(ggplot2)
library(usmap)
cluster_assignments <- kmeans_unstandardised$cluster
classification_df <- data.frame(state=tolower(names(cluster_assignments)),
cluster=as.factor(cluster_assignments))
# And here is a US map of the assignments (DC is too small to see unless you zoom way in).
plot_usmap(regions="states", data=classification_df, values="cluster") +
labs(title="US States",
subtitle="Clusters by Distribution of Employment Sectors.") +
scale_colour_brewer(palette = "Dark2") +
theme(panel.background = element_rect(color = "black",fill = "white"),
legend.position="right",
legend.title = element_blank())
- Use the
Mclust()
function in thelibrary(mclust)
to fit Gaussian mixture models to the unstandardised data. Using the commands below, fit two 5-cluster models:
- One where all the clusters are assumed to have spherical, equal volume distributions of indicator values (EII), i.e where each indicator has the same standard deviation (spherical) and for all clusters (equal volume), and
- One where all the clusters are assumed to have diagonal, equal volume distributions of indicator values (EEI), ie.e where each indicator might have a different standard deviation (diagonal), but these are the same for all clusters (equal volume).
Have a look at the mean values of the first indicator for all five clusters for each of the models (EII and EEI), respectively. Also have a look at variance for the first and second indicators, respectively, as well as the covariance between them (you can access the variance-covariance matrix for equal shape models via
your_mclust_model$parameters$variance$Sigma
).
library(mclust)
mclust_fit_eii <- Mclust(bea,G=5,modelNames = "EII") # spherical, equal volume (and shape)
mclust_fit_eei <- Mclust(bea,G=5,modelNames = "EEI") # diagonal, equal volume and shape
- Compare the cluster assignments from these two models (
$classification
in the saved model object) to those from the k-means clustering. You should be able to find two clusterings that exactly match. Explain why. Remember that the specific numerical labels (1, 2, 3, 4 and 5) are completely arbitrary, it is the groupings that are meaningful.