10 Unsupervised Class Measurement

Topics: Unsupervised classification (clustering, latent class analysis).

Required reading:

Further reading:

Theory


10.1 Seminar

The data for this assignment are an extract from the US Bureau of Economic Analysis data on the distribution of employment by sector across all 51 US states (incl. Washington DC) in 2018. You can directly load the data file into R from the web with the following command:

load("week-10-bea-employment-by-state.Rdata")
bea <- 100*bea # convert proportions to percentages, for easier interpretation

It is recommended to multiply bea by 100 to convert everything to percentage points, as it makes the numbers easier to read at various points, but you can leave the values as proportions if you prefer.

For this assignment, you will also need to install and load the R package mclust:

  1. What is the standard deviation of employment levels in each sector, across all US states? What do we learn from the relative magnitudes of these standard deviations?
  1. If we are using unsupervised methods like principle components analysis or k-means clustering, what is the argument for doing our analysis on unstandardized levels of employment in each sector? What is the argument for doing analysis on standardized levels of employment in each sector, where the standardization is across states such that each sector has mean 0 and standard deviation 1?
  1. Use principle components analysis (see class assignment 9) on these data, using all indicators. Fit and save the principle components analysis with and without standardising the indicators. Plot the first (x-axis) and second (y-axis) principle components, with state name labels. Do two of these plots, one for PCA on standardised indicators and one for PCA on unstandardised indicators. You can create a new dataset with standardised indicators as below, but you can also make this choice using the argument scale.= of prcomp():
bea_standardised <- apply(bea,2,scale) # apply to all indicators
  1. Investigate the obvious outlier that will be apparent in the PCA plots. Which indicators lead to this unit being an outlier? What (extremely basic) fact about American political geography explains why this unit is an outlier?
  1. Use k-means clustering on these data, using all indicators and \(k=5\) clusters. As with PCA, fit and save the k-means cluster analysis with and without standarising the indicators. Use the nstart = 50 option for kmeans() to ensure that you get a stable result (this will fit the algorithm 50 times from different start values, and select the best fitting one). Update your principle components plots, colouring by cluster.
  1. Describe the five clusters that you find using the unstandardised indicators. What distinguishes them from one another? You will need to explore a bit in the indicator data, and also look at the cluster means your_kmeans_object$centers. You might want to make a nice table of the latter, rounding the values appropriately for easy reading and comparison.
  1. As part of your investigations you could use the following code to map the clusters. Note that you need to assign your cluster assignments tocluster_assignments with something like cluster_assignments <- your_kmeans_object$cluster in order for this code to work.
library(ggplot2)
library(usmap)

cluster_assignments <- kmeans_unstandardised$cluster
classification_df <- data.frame(state=tolower(names(cluster_assignments)),
                                cluster=as.factor(cluster_assignments))

# And here is a US map of the assignments (DC is too small to see unless you zoom way in).
plot_usmap(regions="states", data=classification_df, values="cluster") + 
  labs(title="US States",
       subtitle="Clusters by Distribution of Employment Sectors.") + 
  scale_colour_brewer(palette = "Dark2") + 
  theme(panel.background = element_rect(color = "black",fill = "white"), 
        legend.position="right",
        legend.title = element_blank()) 
  1. Use the Mclust() function in the library(mclust) to fit Gaussian mixture models to the unstandardised data. Using the commands below, fit two 5-cluster models:
    1. One where all the clusters are assumed to have spherical, equal volume distributions of indicator values (EII), i.e where each indicator has the same standard deviation (spherical) and for all clusters (equal volume), and
    2. One where all the clusters are assumed to have diagonal, equal volume distributions of indicator values (EEI), ie.e where each indicator might have a different standard deviation (diagonal), but these are the same for all clusters (equal volume).

Have a look at the mean values of the first indicator for all five clusters for each of the models (EII and EEI), respectively. Also have a look at variance for the first and second indicators, respectively, as well as the covariance between them (you can access the variance-covariance matrix for equal shape models via your_mclust_model$parameters$variance$Sigma).

library(mclust)
mclust_fit_eii <- Mclust(bea,G=5,modelNames = "EII") # spherical, equal volume (and shape)
mclust_fit_eei <- Mclust(bea,G=5,modelNames = "EEI") # diagonal, equal volume and shape
  1. Compare the cluster assignments from these two models ($classification in the saved model object) to those from the k-means clustering. You should be able to find two clusterings that exactly match. Explain why. Remember that the specific numerical labels (1, 2, 3, 4 and 5) are completely arbitrary, it is the groupings that are meaningful.

References

Bartholomew, David J, Fiona Steele, Jane Galbraith, and Irini Moustaki. 2008. Analysis of Multivariate Social Science Data. Chapman; Hall/CRC.
Everitt, Brian, and Torsten Hothorn. 2011. An Introduction to Applied Multivariate Analysis with r. Springer Science & Business Media.