6  Text Scaling Models

6.1 Estimating Intra-Party Disagreement on Twitter

On July 7th, 2022, Boris Johnson – then Prime Minister of the UK – resigned, triggering one of the more ridiculous periods in contemporary British politics.1 His resignation marked the start of a Conservative Party leadership election (one of two held in the space of a few months in 2022) in which the main challengers were Liz Truss, the then Foreign Secretary, and Rishi Sunak, the then Chancellor of the Exchequer.

  • 1 Which, given the rest of contemporary British politics, is really saying something.

  • The leadership contest consisted of several rounds of voting among Conservative MPs in Parliament. In the first round of voting, there were several candidates for leadership who were whittled down to a smaller number in the second and subsequent rounds. Eventually, Conservative party members (“real” people rather than politicians) were able to vote on which candidate they preferred.

    In this assignment, we will use a corpus of tweets sent by Conservative MPs during this period to try to estimate a dimension of ideological disagreement within the Conservative Party.2 Our goal will be to try to locate each MP for whom we have Twitter data on this dimension, and use their estimated ideological position to predict whether they voted for Sunak or Truss in the final round of the leadership election.

  • 2 Measuring intra-party disagreement is a popular past-time of political scientists, largely because several important theories of political decision-making rely on spatial metaphors about political preferences. To test these theories, it is helpful to be able to locate different actors in particular ideological spaces.

  • 6.2 Data

    For today’s assignment we will use the conservative_mp_tweets.Rdata file which contains a tibble object named mp_tweets. This tibble contains 114022 sent by 249 Conservative MPs during 2022. The data contains several variables, including the following:

    Variables in the mp_tweets data.
    Variable Description
    name The name of the MP
    party The party of the MP (always “Conservative” here)
    text The text of the tweet
    endorsed Whether the MP endorsed Liz Truss or Rishi Sunak in the first round of the leadership election
    endorsed_final Whether the MP endorsed Liz Truss or Rishi Sunak in the final round of the leadership election
    followers The number of Twitter followers the MP has

    Once you have downloaded the file and stored it somewhere sensible, you can load it into R:

    load("conservative_mp_tweets.Rdata")

    You can take a quick look at the variables in the data by using the glimpse() function from the tidyverse package:

    glimpse(mp_tweets)
    Rows: 114,022
    Columns: 10
    $ name           <chr> "David T. C. Davies", "Chris Green", "Jo Churchill", "J…
    $ username       <chr> "DavidTCDavies", "CGreenUK", "Jochurchill_MP", "Jochurc…
    $ party          <chr> "Conservative", "Conservative", "Conservative", "Conser…
    $ constituency   <chr> "Monmouth", "Bolton West", "Bury St Edmunds", "Bury St …
    $ text           <chr> "Congratulations, Maureen. Very much deserved. https://…
    $ created_at     <chr> "2022-01-01T12:34:47.000Z", "2022-01-05T12:03:10.000Z",…
    $ author_id      <chr> "550076077", "1305528373", "2855674449", "2855674449", …
    $ endorsed       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Rishi …
    $ endorsed_final <chr> "Liz Truss", "Liz Truss", NA, NA, NA, "Liz Truss", NA, …
    $ followers      <dbl> 16978, 14950, 15071, 15071, 14930, 14950, 15071, 14950,…

    6.3 Packages

    You will need to load the following packages before beginning the assignment

    library(tidyverse)
    library(quanteda)
    library(quanteda.textmodels)
    library(quanteda.textplots)

    If you cannot load these libraries, try installing them first.

    6.4 Supervised Scaling – Wordscores

    We will start by using the “wordscores” method to scale the positions of each of the MPs. Wordscores is a supervised scaling method which requires that we have numeric scores assigned to each text in a “reference” set (analogous to the “training” set in supervised classification tasks), where the numeric values indicate the known position of each text on the dimension of interest.

    Here, we will use the endorsed variable here to assign scores for the reference set. This variable has two values: "Rishi Sunak" if the MP endorsed Rishi Sunak at the beginning of the leadership election, and "Liz Truss" if the MP endorsed Liz Truss. Note that we have many missing values on this variable, as many MPs did not endorse either candidate in the first round of the election. The observations with missing values on this variable are therefore our “virgin” texts: we do not have information on their ideological positions, but we are going to use the wordscores model to learn those positions.

    1. Recode the endorsed variable in a new numeric variable which is equal to -1 for MPs that supported Rishi Sunak and 1 for those who supported Liz Truss. Do the same for the endorsed_final variable.
    Reveal code
    mp_tweets$endorsed_numeric <- NA
    mp_tweets$endorsed_numeric[mp_tweets$endorsed == "Rishi Sunak"] <- -1
    mp_tweets$endorsed_numeric[mp_tweets$endorsed == "Liz Truss"] <- 1
    
    mp_tweets$endorsed_final_numeric <- NA
    mp_tweets$endorsed_final_numeric[mp_tweets$endorsed_final == "Rishi Sunak"] <- -1
    mp_tweets$endorsed_final_numeric[mp_tweets$endorsed_final == "Liz Truss"] <- 1
    1. Create a corpus from the mp_tweets data. This corpus will include each tweet as an individual document. However, we will estimate the wordscores model at the level of the MP, which means that we need to collapse the data to the MP level (i.e. we want a single text to represent each MP in the data).

      To do so, use the corpus_group() function. This function will combine the documents by a given grouping variable by concatenating the texts together. Here, we will use the "name" variable to conduct the grouping.3

  • 3 A nice feature of this function is that any document-level metadata associated with the texts will be passed through to the new corpus object so long as they do not vary within groups. Here, this means that all the MP-level information in mp_tweets will end up being a part of the corpus we create.

  • Reveal code
    # Create corpus
    mp_tweets_corpus <- corpus(mp_tweets, text_field = "text")
    
    # Group texts by MP name
    mp_tweets_corpus_grouped <- mp_tweets_corpus %>% 
      corpus_group(name)
    
    # View metadata
    glimpse(docvars(mp_tweets_corpus_grouped))
    Rows: 249
    Columns: 10
    $ name                   <chr> "Aaron Bell", "Adam Afriyie", "Alan Mak", "Albe…
    $ username               <chr> "AaronBell4NUL", "AdamAfriyie", "AlanMakMP", "A…
    $ party                  <chr> "Conservative", "Conservative", "Conservative",…
    $ constituency           <chr> "Newcastle-under-Lyme", "Windsor", "Havant", "S…
    $ author_id              <chr> "240808845", "22031058", "2157036506", "3005893…
    $ endorsed               <chr> NA, NA, NA, NA, "Liz Truss", NA, NA, NA, NA, NA…
    $ endorsed_final         <chr> "Liz Truss", NA, "Rishi Sunak", "Liz Truss", "L…
    $ followers              <dbl> 7565, 17938, 11412, 11656, 17334, 7458, 8285, 2…
    $ endorsed_numeric       <dbl> NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, N…
    $ endorsed_final_numeric <dbl> 1, NA, -1, 1, 1, 1, 1, -1, NA, 1, 1, 1, 1, -1, …
    1. Use the corpus to create a dfm. Make some sensible feature-selection decisions.4
  • 4 Some of the modelling functions later in the assignment will take a long time to run if you have too many features in your dfm. I’d therefore suggest you make liberal use of the dfm_trim function.

  • Reveal code
    tory_dfm <- mp_tweets_corpus_grouped %>% 
      tokens(remove_punct = TRUE,
             remove_symbols = TRUE,
             remove_url = TRUE) %>%
      tokens_remove(stopwords("en")) %>%
      dfm()
    
    # Trim common and rare words
    tory_dfm <- tory_dfm %>%
      dfm_trim(min_docfreq = 0.1,
               max_docfreq = .9,
               docfreq_type = "prop") 
    1. Use the textmodel_wordscores() function to estimate wordscores on the tory_dfm object that you created above. This function requires two arguments:

      1. x, for the dfm you are using to estimate the model
      2. y for the vector of reference scores associated with each training document (i.e. the variable you created in the answer above).
    Reveal code
    tory_wordscores_model <- textmodel_wordscores(x = tory_dfm,
                                                  y = tory_dfm$endorsed_numeric)
    1. Use the predict() function to generate wordscores for all of the texts in the tory_dfm data (i.e. both reference texts and virgin texts).
    Reveal code
    tory_dfm$wordscores <- predict(tory_wordscores_model, 
                                   tory_dfm,
                                   rescaling = "lbg")
    Warning: 1 feature in newdata not used in prediction.
    1. Make a plot with your estimated wordscores on the x-axis and a binary indicator for whether the MP endorsed Liz Truss in the final round of the leadership election on the y-axis. What does this plot tell you about the predictive validity of the estimated wordscores?
    Reveal code
    docvars(tory_dfm) %>%
      ggplot(aes(x = wordscores, y = endorsed_final_numeric)) + 
      geom_point() + 
      theme_bw() + 
      xlab("Estimated Wordscore") + 
      ylab("Endorsed Truss versus Sunak in Final Round")

    There is a reasonably clear relationship here: MPs with higher wordscores are more likely to have endorsed Liz Truss in the final round of the leadership election. This suggests that the wordscores model is doing a pretty good job of estimating the ideological positions of the MPs relative to one another. Not bad for an analysis based on Twitter data!

    6.5 Unsupervised Scaling – Wordfish

    We will now use an unsupervised model to scale the MPs in the Twitter data. In contrast to wordscores, unsupervised methods do not require any labelled training data. Rather, unsupervised scaling models will place similar documents close together on the estimated dimension, and documents that are different from one another further apart.

    1. Estimate a “wordfish” model using the textmodel_wordfish() function applied to the tory_dfm object that you created above. This function takes two arguments.

      1. x – the dfm that you are using to estimate the model
      2. dir – a vector of document indexes which indicates the polarity of the estimated dimension5
  • 5 As discussed in the lecture, the wordfish model is identified only up to a sign-flip. To identify the model, we need to fix the relative position of two \(\theta_i\)s (the estimated ideal points) in order to specify the “direction” of the dimension. The dir argument allows us to do this.

  • Reveal code
    tory_wordfish_model <- textmodel_wordfish(x = tory_dfm,
                                              dir = c(which(tory_dfm$name == "Rishi Sunak"),
                                                      which(tory_dfm$name == "Elizabeth Truss")))
    1. Extract the estimated positions for each of the documents from the fitted wordfish model object. These are stored in wordfish_model_object$theta. Calculate the correlation between these estimated positions and the documents’ wordscores that you estimated above. Do the two approaches capture the same dimension of disagreement?
    Reveal code
    tory_dfm$wordfish <- tory_wordfish_model$theta
    
    cor(tory_dfm$wordfish, tory_dfm$wordscores)
    [1] 0.2806349

    The correlation between the estimated wordscores and wordfish positions is relatively modest. Recall that while wordscores estimate document positions on the basis of the words that best discriminate between groups of texts in the reference set, wordfish estimates document positions on the basis of whatever is the primary dimension of variation in word use across texts.

    1. What are the main discriminating words on the estimated wordfish dimension?6
  • 6 You can find the word-discrimination parameters using the beta element of the estimated wordfish model object. You will need to use these to order() the features element of the estimated object.

  • Reveal code
    tory_wordfish_model$features[order(tory_wordfish_model$beta, decreasing = T)][1:20]
     [1] "recession"   "@nato"       "losses"      "🇩🇪"          "imports"    
     [6] "🇯🇵"          "protocol"    "deepen"      "cooperation" "revenue"    
    [11] "ties"        "@un"         "🇮🇹"          "rises"       "russia’s"   
    [16] "🇪🇺"          "dependence"  "borrowing"   "belfast"     "brussels"   
    tory_wordfish_model$features[order(tory_wordfish_model$beta, decreasing = F)][1:20]
     [1] "derby"                 "@birminghamcg22"       "#worldmentalhealthday"
     [4] "abbey"                 "neil"                  "anne"                 
     [7] "relay"                 "headquarters"          "stoke"                
    [10] "baton"                 "30pm"                  "@age_uk"              
    [13] "peak"                  "stalls"                "drop-in"              
    [16] "@fa"                   "@dcms"                 "nursery"              
    [19] "café"                  "meal"                 

    It is very hard to tell from these words what the main dimension is actually capturing substantively. There is some evidence that one end of the dimension is focused more on language related to international politics (i.e. “nato”, “un”, “u.k”, “cooperation”, etc) but the other set of words is hard to characterise in any straightforward way. These words seem to include a mixture of different issues (“#worldmentalhealthday”, “stroke”, “abbey”, “nursery”) and then a grab-bag of other things.

    1. Report the names of the 10 MPs at the two extremes of the estimated ideological dimension. Do these names suggest anything to you about the meaning of this dimension?
    Reveal code
    tory_dfm$name[order(tory_wordfish_model$theta, decreasing = T)[1:10]]
     [1] "Elizabeth Truss"      "John Redwood"         "Leo Docherty"        
     [4] "James Cleverly"       "Kwasi Kwarteng"       "Mark Pritchard"      
     [7] "Alok Sharma"          "Anne-Marie Trevelyan" "Amanda Milling"      
    [10] "Marcus Fysh"         
    tory_dfm$name[order(tory_wordfish_model$theta, decreasing = F)[1:10]]
     [1] "Amanda Solloway"   "Anne Marie Morris" "Nigel Huddleston" 
     [4] "David Evennett"    "Pauline Latham"    "Alex Burghart"    
     [7] "Jason McCartney"   "Jack Lopresti"     "Neil Hudson"      
    [10] "Alan Mak"         

    The results of this task are somewhat more informative. The first set of MPs are largely important members of the Conservative party – either cabinet ministers, prominent committee chairs, or people who have held other political leadership positions. The second set of MPs are mostly thoroughly unimportant members of parliament.

    This suggests that the model is essentially discovering that the main source of variation in word use in parliamentarians’ tweets is related to their seniority in the party. The model has learned that high-profile and important MPs tend to use certain sets of words more often (notably, words associated with international politics and the business of government), while less senior and important MPs tend to use other words.

    1. Plot the estimated positions of each MP against the number of followers that the MP has (plot the followers on the log scale). What does this plot suggest about the meaning of the estimated wordfish dimension?
    Reveal code
    tory_dfm$wordfish <- tory_wordfish_model$theta
    
    docvars(tory_dfm) %>%
      ggplot(aes(x = log(followers), y = wordfish)) + 
      geom_point() + 
      theme_bw() + 
      xlab("Log twitter followers") + 
      ylab("Wordfish estimate")

    This plot reinforces the interpretation above. There is a clear relationship between the wordfish estimates of “ideological” position and the number of Twitter followers an MP has. This is an example of one of the core issues discussed in the lecture: the wordfish model will recover a dimension that reflects the primary source of linguistic variation. In the texts used here, variation in word use does not appear to be primarily driven by differences in ideology, but rather by differences in prominence/influence/seniority within the Conservative Party. This is a good reminder that we shouldn’t apply wordfish and simply assume that we are automatically measuring political ideology – we need to validate our estimates before interpreting them!

    6.6 Homework

    Apply either the wordfish or wordscores methods to one of the other datasets that we have used on the course so far (i.e. Guardian articles, NHS reviews, constitutions, reddit posts, inaugural speeches, etc). Inspect the output of your model and create at least one plot that communicates something interesting about what you have found. Write a short description of what you have done and upload it to this Moodle page.