7 Supervised Class Measurement

Topics: Assessing whether a target concept should be treated as continuous or categorical. Supervised classification (coding, training).

Required reading:

7.1 Seminar

This week, the variables we are going to use are a set of questions about how respondents think about voting. These are four items, on which respondents could give Strong Agree, Agree, Neither Agree nor Disagree, Disagree or Strongly disagree responses:

  1. (c02_1) Going to vote is a lot of effort
  2. (c02_2) I feel a sense of satisfaction when I vote
  3. (c02_3) It is every citizen’s duty to vote in an election
  4. (c02_4) Most of my friends and family think that voting is a waste of time.
load("4_data/week-8-bes-knowledge.Rdata")
  1. Attempt to specify a coding rule for classifying respondents according to whether they “think they have reasons to vote”. That is, we are trying to code respondents into people who think they have reasons to vote and people who do not think they have reasons to vote. You should both write out in words what your coding rule is, and also implement it in R code such that given the values of c02_1, c02_2, c02_3 and c02_4 you are able to calculate the classification of each respondent. How many people indicate that they have reason to vote and how many do not, under your coding rule? Note: Your coding rule will involve some judgement calls, and there is no one right answer here (there are nonetheless plenty of wrong answers!). You may decide that not all of the four indicators are relevant to this concept, that is fine. You will need to decide what to do with the small number of respondents who give “Don’t know” responses to one or more of these items, you should aim to classify all respondents.
  1. Cross-tabulate your classification against the variables turnout_self and turnout_validated. How well does your classification predict self-reported turnout and validated turnout, respectively?
    Note: Validated turnout is not available for all respondents, but involves the British Election Study staff checking the marked voter register after the election to see if the respondent actually voted.
  1. Now cross-tabulate your classification against the different combinations of turnout_self and turnout_validated and find out the following:
    1. the proportion of individuals who said they voted and were recorded as voting have reasons to vote, according to our measure, i.e the true positives.
    2. the proportion of individuals who said they voted but were not recorded as actually voting have reasons to vote, according to our measure, i.e the false positives.
    3. the proportion of those who said they did not vote and were recorded as not voting have reasons to vote, according to our measure, i.e the true negatives.

Are the false positives more like the true positives or the true negatives in terms of whether they think they have good reasons to vote, as you have defined it? What might we learn from this? Note: there are too few “false negatives” - people claiming they did not vote when they were recorded as having done so - to learn much from this group, so just focus on the true positives, the false positives and the true negatives.

  1. Now let’s take a different approach from theoretically deriving coding rules and train a model instead. Follow the following steps:
    1. Fit a logistic regression model to predict self-reported turnout turnout_self using these four indicators.
    2. Use the saved model object to construct probability predictions for all respondents using predict(model_object,type="response") and save these as well.
    3. Construct dichotomous/binary predictions for each respondent, using 0.5 as the threshold.

What proportion of respondents are classified as more likely than not to vote, given the indicators? What proportion of respondents said they were voters? Given that we trained the model with this response data, why are these so different? If this is not obvious, you may want to take the mean of your probability predictions as well as your binary classifications, and perhaps doing a histogram of the probability predictions.

  1. Repeat the steps 1-3 of Q4, but this time training on the validated turnout data turnout_validated. You do not need to write up any discussion, just calculate the corresponding quantities in R. What is the correlation between the probability predictions based on training using self-reports and based on training using validated turnout? If we look at the binary predictions, what proportion of respondents get each of the four possible combinations of predictions from each of the two models? Note: Use newdata=bes in predict() so that you construct fitted values for all observations (otherwise you will not get predictions for observations dropped due to missing turnout_validated).
  1. Create confusion matrices for your two models. Calculate the total error rate, sensitivity and specificity of each model.