8  Word Embeddings

8.1 Similarities, Analogies and Dictionary Expansion

Word embeddings are all the rage.1 Whenever we have represented words as data in previous weeks, we have simply counted how often they occur across and within documents. We have largely viewed words as individual units: strings of text that uniquely identify a given meaning, for which we have had no natural notion of similarity for grouping similar words together.

  • 1 A good qualitative indicator of the success of an innovation in quantitative methods is when they are discussed in some detail in the London Review of Books

  • By contrast, word-embedding approaches represent each unique word in a corpus as dense, real-valued vector. As we discussed in the lecture, these vectors turn out to encode a lot of information about the ways in which words are used, and this information can put to good use across a wide range of questions in the social sciences.

    In the seminar today, we will familiarise ourselves with some of the pre-trained word embeddings from the GloVe project. We will use these vectors to discover similarities between words, to compute analogy-based tasks, and to supplement the dictionary-based approaches to measurement that we covered in week 2.

    8.2 Packages

    You will need to load the following packages before beginning the assignment

    library(tidyverse)
    library(quanteda)
    library(text2vec)
    # If you cannot load these libraries, try installing them first. E.g.: 
    # install.packages("text2vec")

    8.3 Data

    Today we will be using the pre-trained GloVe embeddings, which can be downloaded from the link at the top of the seminar page. Note that the file which contains the word embeddings is very large! It may therefore take a minute or two to download, depending on your internet conection.

    Despite the large size of the file, we are actually using one of the smaller versions of the GloVe embeddings, which were trained on a combination of Wikipedia and news data. The embeddings are of dimension 300 and cover some 400,000 words. Note that you could replicate any of the assignment here with larger versions of the GloVe embeddings by downloading them from the GloVe project website, but any differences for the applications here are likely to be small.

    8.4 Word Similarities

    1. Load the glove embeddings into R using the load() function.
    Reveal code
    load("glove_embeddings.Rdata")
    1. Look at the dimensions of the glove embeddings object. How many rows does this object have? How many columns? What do these represent?
    Reveal code
    dim(glove)
    [1] 400000    300

    Rows represent words. Columns represet the dimensions of the embedding.

    1. Write a function to calculate the cosine similarity between a selected word and every other word in the glove embeddings object.2 It will be useful for you to use the sim2() function3 from the text2vec package here. Your function should take two inputs:

      1. target_word – the word for which you would like to calculate similarities
      2. n – the number of nearest neighbouring words returned
  • 2 In contrast to previous weeks, I have not given you the full solution here. Instead, I have provided some starter code below. You should be able to work out what goes in each part of this function by looking at the examples in the lecture slides. I will release the full code when I release the homework solutions at the end of the week.

  • 3 Note that this function requires two main arguments: 1) x – a matrix of embeddings. 2) y – a second matrix of embeddings for which you would like to compute similarities. It is important to note that both of these inputs must be in matrix form. As the mean vector embedding that we created above is actually a numeric vector, we have to transform it to a matrix that has the same number of columns as the glove embeddings object. To do so, use the matrix function, setting the nrow argument equal to 1.

  • Reveal code
    similarities <- function(target_word, n){
    
      # Extract embedding of target word
      target_vector <- glove[which(rownames(glove) %in% target_word),]  
      
      # Calculate cosine similarity between target word and other words
      target_sim <- sim2(glove, matrix(target_vector, nrow = 1))
      
      # Report nearest neighbours of target word
      names(sort(target_sim[,1], decreasing = T))[1:n]
    
    }
    1. On the basis of the Glove word embeddings, what are the 7 most similar words to the words “quantitative”, “text”, and “analysis”?
    Reveal code
    similarities("quantitative", n = 7)
    [1] "quantitative" "qualitative"  "empirical"    "measurement"  "analysis"    
    [6] "methodology"  "analytical"  
    similarities("text", n = 7)
    [1] "text"     "texts"    "document" "read"     "messages" "written"  "copy"    
    similarities("analysis", n = 7)
    [1] "analysis"    "analyses"    "study"       "data"        "studies"    
    [6] "analyzed"    "methodology"

    8.5 Word Analogies

    1. Write a function that computes analogies of the form “a is to b as c is to ___”. For instance, if b is”king”, a is “man”, and c is “woman”, then the missing word should be “queen”.
      Your function will need to take four arguments. Three arguments should correspond to the words included in the analogy. The fourth should be an argument specifying the number of nearest neighbouring words returned. Again, I have provided some starter code below, which you should be able to complete by consulting the lecture slides.
    Reveal code
    analogies <- function(a, b, c, n){
      
      # Extract vectors for each of the three words in analogy task
      a_vec <- glove[which(rownames(glove) == a),]
      b_vec <- glove[which(rownames(glove) == b),]
      c_vec <- glove[which(rownames(glove) == c),]
      
      # Generate analogy vector (vector(c) - vector(a) + vector(b))
      target <- c_vec - a_vec + b_vec
      
      # Calculate cosine similarity between anaology vector and all other vectors
      target_sim <- sim2(glove, matrix(target, nrow = 1))
      
      # Report nearest neighbours of analogy vector
      names(sort(target_sim[,1], decreasing = T))[1:n]
    
    }
    1. Use the function you created above to find the word-embedding answers to the following analogy completion tasks.

      • Einstein is to scientist as Picasso is to ___?
      • Arsenal is to football as Yankees is to ___?
      • Actor is to theatre as doctor is to ___?
    Reveal code
    analogies("man", "woman", "king", 6)
    [1] "king"     "queen"    "monarch"  "throne"   "princess" "mother"  
    analogies("einstein", "scientist", "picasso", 6)
    [1] "picasso"   "painter"   "painting"  "artist"    "paintings" "scientist"
    analogies("arsenal", "football", "yankees", 6)
    [1] "baseball"   "yankees"    "sox"        "football"   "basketball"
    [6] "braves"    
    analogies("actor", "theatre", "doctor", 6)
    [1] "theatre"  "doctor"   "hospital" "medical"  "theater"  "doctors" 

    This is really very impressive! Although these examples are cherry-picked, it is notable that the embeddings clearly encode some important aspects of word meaning.

    1. Come up with some of your own analogies and try them here.

    8.6 Dictionary Expansion

    In seminar 2 we used the Moral Foundations Dictionary to score some Reddit posts in terms of their moral content. In this exercise we will expand the “care” category of the MFD using the Glove embeddings.

    For this part of the assignment, you will need to have access to the files that we used in seminar 2. These were mft_dictionary.csv and mft_texts.csv. Go and locate them now (or redownload them if you need them).

    Once you have found these files, load them into R:

    mft_dictionary_words <- read_csv("mft_dictionary.csv")
    mft_texts <- read_csv("mft_texts.csv")
    1. Create a vector of the MFT “Care” words.
    Reveal code
    care_words <- mft_dictionary_words$word[mft_dictionary_words$foundation == "care"]
    1. Extract the embeddings from the glove object relating to the care words.
    Reveal code
    care_embeddings <- glove[rownames(glove) %in% care_words,]
    1. Calculate the mean embedding vector of the care words. To do this, use the colMeans function, which will calculate the mean of each column of the matrix.
    Reveal code
    care_embeddings_mean <- colMeans(care_embeddings)
    1. Calculate the similarity between the mean care vector and every other word in the glove embedding object. To do so, use the sim2() function again.
    Reveal code
    target_sim <- sim2(x = glove,
                       y = matrix(care_embeddings_mean, nrow = 1))
    1. What are the 500 words that have highest cosine similarity with the mean care vector? How many of these words are in the original dictionary?
    Reveal code
    top500 <- names(sort(target_sim[,1], decreasing = T))[1:500]
    
    table(top500%in%care_words)
    
    FALSE  TRUE 
      386   114 
    1. Examine the words that are in the top 500 words that you calculated above but which are not in the original care dictionary. Do these represent the concept of care?
    Reveal code
    top500[!top500%in%care_words]
      [1] "traumatized"     "victimized"      "cruelly"         "helpless"       
      [5] "innocent"        "unborn"          "sufferings"      "endure"         
      [9] "terrified"       "frightened"      "sick"            "callous"        
     [13] "wronged"         "civilians"       "viciously"       "senseless"      
     [17] "terrorizing"     "frighten"        "beatings"        "horrific"       
     [21] "risking"         "inhumane"        "unspeakable"     "confess"        
     [25] "humiliation"     "defenseless"     "loneliness"      "terrorize"      
     [29] "grief"           "bereaved"        "injustice"       "subjecting"     
     [33] "vengeful"        "innocents"       "misfortune"      "abuse"          
     [37] "neglect"         "plight"          "treating"        "traumatised"    
     [41] "stigmatized"     "oppression"      "starving"        "humiliate"      
     [45] "oftentimes"      "grievous"        "unjustly"        "grieving"       
     [49] "humiliated"      "betrayed"        "spared"          "mutilation"     
     [53] "lest"            "ashamed"         "heinous"         "enemies"        
     [57] "abusive"         "barbaric"        "elderly"         "starve"         
     [61] "fear"            "neglecting"      "terrible"        "deprived"       
     [65] "dignity"         "deprive"         "beings"          "dying"          
     [69] "mercilessly"     "hatred"          "grievously"      "oneself"        
     [73] "vicious"         "succumb"         "feelings"        "schoolmates"    
     [77] "fearful"         "horrible"        "horrifying"      "sparing"        
     [81] "hardships"       "afraid"          "risked"          "indiscriminate" 
     [85] "unnecessarily"   "bystanders"      "heartless"       "shame"          
     [89] "ordeal"          "offends"         "subjected"       "enslavement"    
     [93] "handicapped"     "disturbed"       "humanity"        "sacrificing"    
     [97] "degrading"       "oppressed"       "treat"           "retribution"    
    [101] "maiming"         "injure"          "terrify"         "terminally"     
    [105] "unbearable"      "betray"          "suicidal"        "pretending"     
    [109] "seriously"       "perpetrated"     "destitute"       "maimed"         
    [113] "perpetrator"     "insult"          "demeaning"       "depraved"       
    [117] "betraying"       "selfishness"     "aiding"          "traumas"        
    [121] "taunt"           "brutal"          "belittling"      "sickening"      
    [125] "despair"         "needless"        "orphans"         "repress"        
    [129] "terrorized"      "hardship"        "scared"          "infirm"         
    [133] "husbands"        "savagery"        "townspeople"     "mentally"       
    [137] "punishing"       "disciplining"    "sins"            "savagely"       
    [141] "escaping"        "teasing"         "sickness"        "distraught"     
    [145] "captors"         "pregnant"        "oppress"         "spouse"         
    [149] "indignity"       "intolerable"     "intimidated"     "forgiveness"    
    [153] "countrymen"      "cursed"          "conscience"      "knowing"        
    [157] "trauma"          "deserve"         "punishes"        "strangling"     
    [161] "wickedness"      "sinful"          "punished"        "cruelties"      
    [165] "helplessness"    "punish"          "disrespect"      "perceived"      
    [169] "appalling"       "hysterical"      "bystander"       "disfigurement"  
    [173] "treachery"       "homosexuals"     "debilitating"    "evils"          
    [177] "taunts"          "vindictive"      "betrayal"        "tolerate"       
    [181] "provokes"        "thoughtless"     "avenge"          "intimidate"     
    [185] "feel"            "offend"          "despicable"      "needlessly"     
    [189] "malnourished"    "sexually"        "tolerating"      "bodily"         
    [193] "terrifying"      "passivity"       "abduct"          "betrays"        
    [197] "trusting"        "frightening"     "expose"          "misguided"      
    [201] "barbarity"       "severely"        "perceive"        "imprison"       
    [205] "horrendous"      "depriving"       "sadistic"        "committing"     
    [209] "unworthy"        "treated"         "unjust"          "slaughtering"   
    [213] "painful"         "debilitated"     "scolding"        "enslave"        
    [217] "aftereffects"    "disrespectful"   "victimised"      "awaken"         
    [221] "humankind"       "resentful"       "suffocate"       "hypocritical"   
    [225] "cowardly"        "fleeing"         "judgmental"      "ridicule"       
    [229] "insecurities"    "complicit"       "misery"          "caretakers"     
    [233] "detriment"       "physically"      "tenderly"        "habitually"     
    [237] "malnutrition"    "punishment"      "misbehaving"     "villagers"      
    [241] "emotionally"     "relatives"       "selfish"         "deserving"      
    [245] "destitution"     "gravely"         "exposing"        "companionship"  
    [249] "squeamish"       "brutally"        "guilt"           "imprisoning"    
    [253] "orphaned"        "criminals"       "injustices"      "adolescents"    
    [257] "stigma"          "disobedient"     "sacrificed"      "survivors"      
    [261] "remorse"         "empathetic"      "wanton"          "insults"        
    [265] "unsuspecting"    "enslaving"       "aggression"      "involuntarily"  
    [269] "outweighs"       "ignorance"       "screams"         "addicted"       
    [273] "ostracized"      "subordinates"    "immoral"         "caregivers"     
    [277] "merciless"       "perpetrators"    "powerlessness"   "mutilating"     
    [281] "ills"            "depredations"    "stab"            "affection"      
    [285] "ill-treatment"   "ill"             "blinded"         "callously"      
    [289] "insulted"        "manipulative"    "verbally"        "unintentionally"
    [293] "distract"        "illness"         "ungrateful"      "deprivation"    
    [297] "pretend"         "undermines"      "disfigured"      "impotent"       
    [301] "cursing"         "inconvenience"   "motivates"       "witnessing"     
    [305] "aggressors"      "jealousy"        "suffocating"     "children"       
    [309] "sacrifices"      "jealous"         "hideous"         "ostracism"      
    [313] "perverted"       "horrified"       "insensitive"     "confronting"    
    [317] "illnesses"       "befriend"        "scourge"         "willful"        
    [321] "repression"      "revenge"         "annoy"           "scarred"        
    [325] "curses"          "throats"         "jailers"         "horribly"       
    [329] "repressed"       "bigotry"         "excruciating"    "befriending"    
    [333] "coerce"          "traumatic"       "ugliness"        "revulsion"      
    [337] "exposes"         "patronizing"     "stepmother"      "newborns"       
    [341] "indulging"       "promiscuous"     "confront"        "ministering"    
    [345] "unconscionable"  "insidious"       "tenderness"      "indifference"   
    [349] "gruesome"        "trampling"       "embittered"      "despise"        
    [353] "cowardice"       "hopelessness"    "underlings"      "horrors"        
    [357] "scold"           "atrocious"       "banish"          "persecutions"   
    [361] "brainwashed"     "indignities"     "banishing"       "pretended"      
    [365] "abhorrent"       "wrongs"          "genitals"        "counseled"      
    [369] "atrocities"      "susceptible"     "ferocity"        "indifferent"    
    [373] "autistic"        "deprivations"    "demoralizing"    "unfaithful"     
    [377] "transgressions"  "brainwashing"    "ghastly"         "insufferable"   
    [381] "taunting"        "pillage"         "believing"       "unconscious"    
    [385] "afflicting"      "muggers"        

    It is relatively uncontroversial to suggest that these words are representative of the concept of “care”.

    1. What does your answer to the previous question suggest about this dictionary expansion approach?
    Reveal code

    The idea here is that using the word-embeddings has allowed us to automatically expand the set of of words associated with a concept that we previously measured using a dictionary of terms that were manually selected. This means that we can use information about word similarity to supplement our existing measurement approaches. Whether this leads to performance improvements in terms of classification is the subject of this week’s homework.

    8.7 Homework

    The mft_texts object includes a series of variables that record the human annotations of which category each text falls into. In this week’s homework, you will again use dictionary-based methods to score the texts, and compare the dictionary scores to those human codings. If you have forgotten how to apply dictionaries, go back and look at the material in seminar two.

    1. Create a new dictionary which includes two categories. The first should be a care_original_words category, which contains only the words from the original care dictionary. The second should be a care_embedding_words which contains both the original care words and the top 500 words that you extracted in the last section.
    Reveal code
    # Create a dfm
    mft_dfm <- mft_texts %>% 
      corpus(text_field = "text") %>% 
      tokens(remove_punct = TRUE) %>% 
      dfm() %>%
      dfm_trim(min_termfreq = 5)
    
    # Create a care dictionary
    care_dictionary <- dictionary(list(care_original = care_words,
                                       care_embedding = c(top500[!top500%in%care_words], care_words)))
    1. Use the dictionary you just constructed to score the scores in the mft_texts object. Create variables in that object that indicate whether a given dictionary classifies each text as a care text or not (i.e. classify a text as a care text if it contains any words from the relevant dictionary).
    Reveal code
    # Score the texts using the dictionaries
    care_dfm_dictionary <- dfm_lookup(mft_dfm, care_dictionary)
    mft_texts$care_original <- as.numeric(care_dfm_dictionary[,1]) > 0
    mft_texts$care_embedding <- as.numeric(care_dfm_dictionary[,2]) > 0
    1. Create a confusion matrix which compares the human annotations to the scores generated by the dictionary analysis. Which performs best, the original dictionary or the word-embedding approach?
    Reveal code
    # Calculate confusion matrix
    care_original_confusion <- table( 
      dictionary = mft_texts$care_original > 0,
      human_coding = mft_texts$care)
    
    care_embedding_confusion <- table( 
      dictionary = mft_texts$care_embedding > 0,
      human_coding = mft_texts$care)
    
    # Calculate performance statistics
    library(caret)
    confusionMatrix(care_original_confusion, positive = "TRUE")
    Confusion Matrix and Statistics
    
              human_coding
    dictionary FALSE  TRUE
         FALSE 11304  2621
         TRUE   1842  2119
                                              
                   Accuracy : 0.7505          
                     95% CI : (0.7441, 0.7568)
        No Information Rate : 0.735           
        P-Value [Acc > NIR] : 1.219e-06       
                                              
                      Kappa : 0.3239          
                                              
     Mcnemar's Test P-Value : < 2.2e-16       
                                              
                Sensitivity : 0.4470          
                Specificity : 0.8599          
             Pos Pred Value : 0.5350          
             Neg Pred Value : 0.8118          
                 Prevalence : 0.2650          
             Detection Rate : 0.1185          
       Detection Prevalence : 0.2215          
          Balanced Accuracy : 0.6535          
                                              
           'Positive' Class : TRUE            
                                              
    confusionMatrix(care_embedding_confusion, positive = "TRUE")
    Confusion Matrix and Statistics
    
              human_coding
    dictionary FALSE  TRUE
         FALSE 10115  1764
         TRUE   3031  2976
                                              
                   Accuracy : 0.7319          
                     95% CI : (0.7254, 0.7384)
        No Information Rate : 0.735           
        P-Value [Acc > NIR] : 0.8265          
                                              
                      Kappa : 0.366           
                                              
     Mcnemar's Test P-Value : <2e-16          
                                              
                Sensitivity : 0.6278          
                Specificity : 0.7694          
             Pos Pred Value : 0.4954          
             Neg Pred Value : 0.8515          
                 Prevalence : 0.2650          
             Detection Rate : 0.1664          
       Detection Prevalence : 0.3358          
          Balanced Accuracy : 0.6986          
                                              
           'Positive' Class : TRUE            
                                              

    In this instance, the accuracy of our predictions decreases slightly when incorporating the words from the word-embedding similarity scores, but the sensitivity of the classification increases significantly. This suggests that the additional words are substantially improving our ability to identify texts that express the concept of care, albeit at the cost of also introducing a larger number of false positive results.

    You should upload a paragraph explaining the results from the final question of the homework to this Moodle page.