8: Word Embeddings

Jack Blumenau & Gloria Gennaro

Introduction to Word Embeddings

Sparse Representations of Words

Up until this point of the course, we have always implicitly used representations of words that are sparse.

Words were represented as one-hot encoding, i.e. word-specific vectors that take the value of 1 only for that word, and 0 for all others. E.g.

\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0, & 0, & 1, & 0, & ..., & 0 \end{bmatrix} \end{align}\]

\[\begin{align} w_{\text{deficit}} &= \begin{bmatrix} 0, & 0, & 0, & 1, & ..., & 0 \end{bmatrix} \end{align}\]

The problem with this representation is that they contain no notion of similarity between words.

The dot product between two word vectors is zero:

\[ cos(\theta) = w_{\text{deficit}}^Tw_{\text{debt}} = \frac{\mathbf{w_{\text{deficit}}} \cdot \mathbf{w_{\text{debt}}}}{\left|\left| \mathbf{w_{\text{deficit}}} \right|\right| \left|\left| \mathbf{w_{\text{debt}}} \right|\right|}= 0 \]

This is true for any pair of words, which is clearly nonsense as some pairs of words are more similar to each other than other pairs of words.

Problems with Sparse Word Representations

Mechanical problems:

  1. Similarity

    • Documents might have zero term overlap, but have nearly identical meanings
    • E.g. “Quantitative text analysis is very successful.” vs “Natural language processing is tremendously effective.”
  2. Classification/Dictionaries/Supervised scaling

    • We may know or learn that one word is connected to a concept, but that doesn’t tell us anything about other similar words
    • If we learn that “turmeric” is highly predictive of the concept of interest, shouldn’t we also learn something about “garlic”, “saffron”, and “ginger”?
  3. Topic models/Unsupervised scaling

    • If “bank”, “economy”, “interest”, and “rates” have high probability under a topic, shouldn’t “monetary” also have high probability?

Substantive problem: Our representation does not encode any information about the “meaning” of a word. The binary representation implies that all words are distinct to an equal degree, which seems a strong assumption.

\(\Rightarrow\) We would prefer a representation that allows us to capture similarities between word meanings.

Distributional Semantics

The distributional hypothesis: the meaning of a word can be derived from the distribution of contexts in which it appears.

  • We can learn about the meaning of a word by investigating the distribution of words that show up around the word

    • “You shall know a word by the company it keeps!” J.R. Firth (1957)
    • “The meaning of words lies in their use” Ludwig Wittgenstein (1953)
  • The hypothesis implies that words that appear in similar “contexts” will share similar meanings

  • This simple (and old) idea is one of the most influential and successful ideas in modern natural language processing

  • Word embedding approaches represent the distributional “meaning” of a word as a vector in multidimensional space

Distributional Semantics

  • When a word \(j\) appears in a text, its “context” is the set of words that appear nearby (within a fixed-size window)

  • We use the many contexts of \(w\) to build up a representation of \(w\)

pre keyword post
can be delivered for the banking industry in Europe . I
instance I am referring to banking . It is not only
that , if the second banking directive comes into force without
the future of the British banking industry within the European Community
the future of the British banking industry within the European Community
the Government expect the second banking directive to come into force
pre keyword post
during the passage of the Finance Bill , but I can
is referring to taxpayers ' finance and public sector funding ,
world when it comes to finance . The industry is being
of the European Council of Finance Ministers with his Community colleagues
of the European Council of Finance Ministers with his Community colleagues
practically involved in local government finance . It must have come

Word Embedding Overview

  1. The meaning of each word is based on the distribution of terms with which it co-occurs

  2. We represent this meaning using a vector for each word

  3. Vectors are constructed such that similar words are close to each other in “semantic” space

  4. We build this space automatically by seeing which words are close to one another in texts

Dense Representations of Words

Our goal will be to build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts (measuring similarity as the dot product)

\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0.73 \\ 0.04 \\ 0.07 \\ -0.18 \\ 0.81 \\ -0.97 \end{bmatrix} \end{align}\]

\[\begin{align} w_{\text{deficit}} &= \begin{bmatrix} 0.63 \\ .14 \\ .02 \\ -0.58 \\ 0.43 \\ -0.66 \end{bmatrix} \end{align}\]

These representations are known as word embeddings because we “embed” words into a low-dimensional space (low compared to the vocabulary size).

Advantages of Word Embeddings

Low-dimensional word embeddings offer three core advantages over simple word counts. They:

  1. Encode similarity between words

    • We no longer have word similarities of zero!
    • Each word is a vector, with the vectors of similar words closer together than vectors of very different words
  2. Allow for “automatic generalization”

    • Imagine that we discover “fantastic” is a good predictor of positive reviews, but we never observe the word “extraordinary” in our training corpus
    • Because “fantastic” and “extraordinary” will have similar word vectors, we can share information across words, and apply what we have learned about one word to our understanding of another
    • Can lead to large performance gains for prediction and topic modelling tasks
  3. Provide a measure of meaning

    • We will often be interested in the meaning of words as a quantity in its own right
    • Very specific meaning of “meaning”: two words share a meaning when used in similar contexts

Contrasting Approaches

The material that we study this week is different on several dimensions from the approaches on the course so far:

Traditional Unsupervised Traditional Supervised Word Embeddings
Bag of words? Yes Yes No
Inputs DFM DFM Feature co-occurence matrix
Outputs Topics;Scales Categorization Word vectors

Estimating Word Embeddings

Design Choices in Word Embeddings

  1. Data

    • High-quality embeddings require a large amount of training data
    • Usually trained on large external corpora (i.e. wikipedia; news articles; web pages)
    • Embeddings will reflect the language used in the training documents
  1. Context Window Size

    • If meaning is defined by a word’s context, we need to define context
    • Usually implemented as a symmetric window of some length around each word
    • The size of the window dictates the kind of information the embedding will capture
  1. Dimension of the Embedding

    • The embedding for each word will typically be between 50 and 500 elements long
    • The embedding encodes information about the contexts a word appears in, so in theory larger embeddings are able to encode more information
    • In practice, medium-dimensional embeddings (100-300) work very well
  1. Algorithm

    • There are several algorithms that allow us to learn embeddings from data
    • Count-based, sparse approaches (e.g. co-occurence vectors)
    • Neural network, dense approaches (e.g “Word2Vec” and “GloVe”)

Co-occurence Vectors

One simple “embedding” is produced by counting the occurrences of any term within a fixed window of any other term and store these in a feature-cooccurence matrix:

debates_fcm <- 
  debates_corpus %>%
  tokens() %>%
  tokens_tolower() %>%
  fcm(context = "window",
      window = 3,
      tri = FALSE)

save(debates_fcm, file = "../data/debates_fcm.Rdata")

debates_fcm
Feature co-occurrence matrix of: 330,573 by 330,573 features.
          features
features   before     the  house proceeds      to choice      of       a speaker       ,
  before      948  119349  17938       89   28687    213   21406   18847     277   47978
  the      119349 3792540 580486     4167 3629927  13841 6681361  628636   18557 3962556
  house     17938  580486   2006      112  150325    176  211720   33338    1401  124959
  proceeds     89    4167    112        6     817      1    2572     197       0     870
  to        28687 3629927 150325      817 1053882   7231  736722  957468    5362 1291600
  choice      213   13841    176        1    7231    298    9407   10098      31    8430
  of        21406 6681361 211720     2572  736722   9407  755212 1244577    2787 1491035
  a         18847  628636  33338      197  957468  10098 1244577  161464    2750  985453
  speaker     277   18557   1401        0    5362     31    2787    2750     412   83717
  ,         47978 3962556 124959      870 1291600   8430 1491035  985453   83717 1766536
[ reached max_feat ... 330,563 more features, reached max_nfeat ... 330,563 more features ]
  • The word “house” appears within three words of the word “before” 17938 times in this corpus

  • The word “speaker” never appears within three words of “proceeds” in this corpus

  • Etc

Typically, these vectors are then weighted by the pointwise mutual information:

\[PMI(w,c) = log\frac{P(w,c)}{P(w)P(c)}\]

\(\rightarrow\) PMI compares the probability of two words occurring together to what this probability would be if the words were unrelated (independent in probabilities)

\(\rightarrow\) larger weight on feature combinations that are more frequent, lower weights on features that commonly occur with many other features (similar intuition to tf-idf).

Co-occurence Vectors

Given this representation, we can calculate the cosine similarity between the word vectors of target words to find the closest other words in the embedding space:

library(quanteda.textstats)

word_similarities <- textstat_simil(debates_fcm,
                                    debates_fcm[which(featnames(debates_fcm) %in% c("election", "health", "banking")),],
                                    method = "cosine",
                                    margin = "documents")


sort(word_similarities[,1], decreasing = TRUE)[1:10]
  election  elections referendum    general  electoral    elected  manifesto      party       vote     labour 
 1.0000000  0.2674648  0.2168565  0.2080293  0.1890573  0.1839103  0.1837215  0.1829961  0.1822697  0.1813167 
sort(word_similarities[,2], decreasing = TRUE)[1:10]
    health     mental        nhs       care    service   services     social healthcare  education    medical 
 1.0000000  0.3665237  0.2921888  0.2862733  0.2525538  0.2429016  0.2328676  0.2268613  0.2253347  0.2222046 
sort(word_similarities[,3], decreasing = TRUE)[1:10]
     banking        banks      lending    financial         bank institutions    corporate     consumer   regulatory      markets 
   1.0000000    0.2769244    0.2317889    0.2298129    0.2213940    0.2132315    0.2081518    0.2057080    0.2055228    0.2046225 

Co-occurence Vectors

  • Co-occurence vectors clearly capture something about the meaning of words

  • However, these vectors increase in the size of the vocabulary of the corpus

    • The vectors above each have length 330573
  • One consequence is that they tend to be very sparse (most words fail to occur with most other words)

    • E.g. the sparsity of the example above is 99.9%
  • In most applications, sparse vectors like these tend to perform less well than dense vectors

    • Similarity; classification; unsupervised learning, etc
  • We would therefore prefer a low-dimensional representation that didn’t suffer from these sparsity issues

Word-2-Vec Overview

Word2Vec (Mikolov et al, 2013) is a set of related methods for learning dense word vectors.

One version of Word2Vec – skip-gram with negative sampling – follows this basic process:

  1. Start with a very large corpus of text (i.e. all of Wikipedia)

  2. Represent each word in the vocabulary as a vector, \(\mu_j\)

    • Initialise each vector with random numbers
  3. Go through each position, \(t\), in the text, where each position has

    • A center word, \(t\) (“target” words)
    • Context words, \(o\) (“outside” words)
  4. Calculate the probability of observing \(o\) given \(t\) (or vice versa), using the similarity of the word vectors for \(o\) and \(t\)

  5. Adjust the values of the word vectors, \(\mu_j\), to …

    • …maximize the probability of observing true context words
    • …minimize the probability of observing other words from the corpus

Word-2-Vec Intuition

What is the probability of observing the context words given the center word, ‘UN’?
\[\begin{equation} \text{important }\underbrace{\text{\textcolor{orange}{to get }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{UN }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{agreement as }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{the last hope of demonstrating international agreement }\end{equation}\]\(p(w_o = \text{to}|w_t = \text{UN})\)
\(p(w_o = \text{get}|w_t = \text{UN})\)
\(p(w_o = \text{agreement}|w_t = \text{UN})\)
\(p(w_o = \text{as}|w_t = \text{UN})\)

What is the probability of observing the context words given the center word, ‘agreement’?
\[\begin{equation} \text{important to }\underbrace{\text{\textcolor{orange}{get UN }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{agreement }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{as the }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{last hope of demonstrating international agreement }\end{equation}\]\(p(w_o = \text{get}|w_t = \text{agreement})\)
\(p(w_o = \text{UN}|w_t = \text{agreement})\)
\(p(w_o = \text{as}|w_t = \text{agreement})\)
\(p(w_o = \text{the}|w_t = \text{agreement})\)

What is the probability of observing the context words given the center word, ‘as’?
\[\begin{equation} \text{important to get }\underbrace{\text{\textcolor{orange}{UN agreement }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{as }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{the last }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{hope of demonstrating international agreement }\end{equation}\]\(p(w_o = \text{UN}|w_t = \text{as})\)
\(p(w_o = \text{agreement}|w_t = \text{as})\)
\(p(w_o = \text{the}|w_t = \text{as})\)
\(p(w_o = \text{last}|w_t = \text{as})\)

What is the probability of observing the context words given the center word, ‘the’?
\[\begin{equation} \text{important to get UN }\underbrace{\text{\textcolor{orange}{agreement as }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{the }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{last hope }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{of demonstrating international agreement }\end{equation}\]\(p(w_o = \text{agreement}|w_t = \text{the})\)
\(p(w_o = \text{as}|w_t = \text{the})\)
\(p(w_o = \text{last}|w_t = \text{the})\)
\(p(w_o = \text{hope}|w_t = \text{the})\)

What is the probability of observing the context words given the center word, ‘last’?
\[\begin{equation} \text{important to get UN agreement }\underbrace{\text{\textcolor{orange}{as the }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{last }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{hope of }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{demonstrating international agreement }\end{equation}\]\(p(w_o = \text{as}|w_t = \text{last})\)
\(p(w_o = \text{the}|w_t = \text{last})\)
\(p(w_o = \text{hope}|w_t = \text{last})\)
\(p(w_o = \text{of}|w_t = \text{last})\)

What is the probability of observing the context words given the center word, ‘hope’?
\[\begin{equation} \text{important to get UN agreement as }\underbrace{\text{\textcolor{orange}{the last }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{hope }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{of demonstrating }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{international agreement }\end{equation}\]\(p(w_o = \text{the}|w_t = \text{hope})\)
\(p(w_o = \text{last}|w_t = \text{hope})\)
\(p(w_o = \text{of}|w_t = \text{hope})\)
\(p(w_o = \text{demonstrating}|w_t = \text{hope})\)

Word2Vec Objective Function

  • The objective of the Word2Vec model is to maximise the average log probability:

\[ \frac{1}{T}\sum_{t=1}^T \sum_{-c \leq j \leq c, j\neq0} \text{log} p(w_{t+j}|w_t) \]

where probability, \(p(w_{t+j}|w_t)\), is defined as:

\[\begin{equation} p(w_{t+j}|w_t) = \frac{\text{exp}(v_{o}^T \cdot v_{t})}{\sum_{w=1}^W\text{exp}(v_{o}^T \cdot v_{t})} \end{equation}\]

  • This is an example of the softmax function, which maps arbitrary values to a probability distribution.
  • Goal: Learn values of the word-embeddings, \(v_w\), that maximise the probability of observing the context words that we actually do observe.
  • Approach: We learn the values of the word-embeddings by stochastic gradient descent – we gradually adjust the values of the embeddings to maximise the probabilities above
  • Magic: Doing no more than this allows the algorithm to learn word vectors that capture word similarity and meaningful directions in a word space

Learning Skip-Gram Embeddings (Negative Sampling)

The denominator of the softmax function defined above is very computationally expensive to evaluate repeatedly and so Word2Vec recasts the problem as supervised learning problem.

  1. Select word from position \(t\) and select the “positive” words (\(Y=1\)) that fall in its context (i.e. \(\pm 2\) words)
  1. For each true context word, select \(K\) “negative” context words (\(Y=0\)) at random from the entire corpus
  1. Run a logistic regression, with the positive/negative variable as outcome, and the dot product between the words’ embeddings as predictor (\(v_{o}^T \cdot v_{t}\))
  1. Adjust the values of the embeddings to better distinguish between the positive and negative context words
\(t\) \(o\) \(Y\)
UN to 1
UN get 1
UN agreement 1
UN as 1
\(t\) \(o\) \(Y\)
UN to 1
UN get 1
UN agreement 1
UN as 1
UN castle 0
UN when 0
UN chair 0
UN yoyo 0
UN pancake 0
UN whilst 0
UN tremmor 0
UN foot 0
\(t\) \(o\) \(Y\) \(v_{o}^T \cdot v_{t}\)
UN to 1 0.172
UN get 1 0.226
UN agreement 1 0.619
UN as 1 0.351
UN castle 0 0.112
UN when 0 -0.559
UN chair 0 -0.323
UN yoyo 0 0.371
UN pancake 0 0.106
UN whilst 0 0.021
UN tremmor 0 0.052
UN foot 0 0.105
\(t\) \(o\) \(Y\) \(v_{o}^T \cdot v_{t}\)
UN to 1 0.279
UN get 1 0.465
UN agreement 1 0.36
UN as 1 0.164
UN castle 0 -0.174
UN when 0 -0.23
UN chair 0 0.043
UN yoyo 0 -0.19
UN pancake 0 -0.049
UN whilst 0 -0.106
UN tremmor 0 -0.02
UN foot 0 -0.208

The goal of the learning algorithm is to maximise the similarity of the target and positive context vectors, and minimize the similarity between the target and negative context word vectors.

Word2Vec Intuition

Question: Why do the estimated word-embeddings encode information about word similarity?

  • The predicted probability of a context word is high when the dot product between the context word’s embedding and the target word’s embedding is high

    • \(p(w_{t+j}|w_t) = \frac{\text{exp}(v_{o}^T \cdot v_{t})}{\sum_{w=1}^W\text{exp}(v_{o}^T \cdot v_{t})}\)

    • \(v_{o}^T \cdot v_{t} = \sum_{i=1}^n v_{o,i}v_{t,i}\)

  • This encourages the model to find embedding vectors that are similar to one another for words that occur together frequently in the corpus

  • It also encourages the model to find embedding vectors that are similar to one another for words that appear in similar contexts in the corpus, even if they rarely appear together. E.g.

    • “worldcom” and “scandal” appear frequently together
    • “enron” and “scandal” appear frequently together
    • But “worldcom” and “enron” appear infrequently together
    • “worldcom” and “enron” will still need relatively proximate embeddings in order to predict the occurrence of “scandal” in both contexts

GloVe: Global Vectors for Word Representation

  • The Glove algorithm builds directly on the idea of the co-occurence vectors that we discussed previously
  • Weighted least squares model that learns dense vectors from the word-word co-occurence counts
  • GloVe models the log of the number of times that each word appears in the context of each other word:

\[\min_\theta J(\theta) \ \ \ \text{where} \ \ \ J(\theta) = \sum_{i=1}^V\sum_{i,j=1}^Vf(X_{i,j}) (v_i^T \cdot v_j - \text{log}(X_{i,j}))^2\]

  • Where

    • \(X\) is a word-word co-occurence matrix
    • \(X_{i,j}\) is the number of times word \(j\) appears in the context of word \(i\)

Intuition:

  • The GloVe model tries to make it so that the dot product between the word vectors for \(i\) and \(j\) are equal to the log of the co-occurrence between the words
  • \(f(X_{i,j})\) is a weighting function that puts somewhat higher weights on more infrequent words (so that very common words do not dominate)

GloVe vs Word2Vec

  • The core difference between the two models is that GloVe is a model for the global co-occurence counts, while Word2Vec is an “online” model which trains progressively on a moving window
  • The GloVe model has some advantages over Word2Vec

    • Very fast
    • Easily scales to very large corpora
    • Good performance on small corpora
  • In both models, the researcher has to make several decisions that can be consequential to the estimated word vectors

    • Context-window size
    • Embedding dimensionality
    • Pre-trained versus local fit

Glove Embeddings

glove <- readRDS("../data/glove.rds")
str(glove)
 num [1:400000, 1:300] 0.0466 -0.2554 -0.1256 -0.0769 -0.2576 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:400000] "the" "," "." "of" ...
  ..$ : NULL
glove[1,]
  [1]  0.0465600  0.2131800 -0.0074364 -0.4585400 -0.0356390  0.2364300 -0.2883600  0.2152100 -0.1348600 -1.6413000 -0.2609100  0.0324340  0.0566210 -0.0432960 -0.0216720  0.2247600 -0.0751290 -0.0670180 -0.1424700  0.0388250 -0.1895100  0.2997700  0.3930500  0.1788700 -0.1734300 -0.2117800
 [27]  0.2361700 -0.0636810 -0.4231800 -0.1166100  0.0937540  0.1729600 -0.3307300  0.4911200 -0.6899500 -0.0924620  0.2474200 -0.1799100  0.0979080  0.0831180  0.1529900 -0.2727600 -0.0389340  0.5445300  0.5373700  0.2910500 -0.0073514  0.0478800 -0.4076000 -0.0267590  0.1791900  0.0109770
 [53] -0.1096300 -0.2639500  0.0739900  0.2623600 -0.1508000  0.3462300  0.2575800  0.1197100 -0.0371350 -0.0715930  0.4389800 -0.0407640  0.0164250 -0.4464000  0.1719700  0.0462460  0.0586390  0.0414990  0.5394800  0.5249500  0.1136100 -0.0483150 -0.3638500  0.1870400  0.0927610 -0.1112900
 [79] -0.4208500  0.1399200 -0.3933800 -0.0679450  0.1218800  0.1670700  0.0751690 -0.0155290 -0.1949900  0.1963800  0.0531940  0.2517000 -0.3484500 -0.1063800 -0.3469200 -0.1902400 -0.2004000  0.1215400 -0.2920800  0.0233530 -0.1161800 -0.3576800  0.0623040  0.3588400  0.0290600  0.0073005
[105]  0.0049482 -0.1504800 -0.1231300  0.1933700  0.1217300  0.4450300  0.2514700  0.1078100 -0.1771600  0.0386910  0.0815300  0.1466700  0.0636660  0.0613320 -0.0755690 -0.3772400  0.0158500 -0.3034200  0.2837400 -0.0420130 -0.0407150 -0.1526900  0.0749800  0.1557700  0.1043300  0.3139300
[131]  0.1930900  0.1942900  0.1518500 -0.1019200 -0.0187850  0.2079100  0.1336600  0.1903800 -0.2555800  0.3040000 -0.0189600  0.2014700 -0.4211000 -0.0075156 -0.2797700 -0.1931400  0.0462040  0.1997100 -0.3020700  0.2573500  0.6810700 -0.1940900  0.2398400  0.2249300  0.6522400 -0.1356100
[157] -0.1738300 -0.0482090 -0.1186000  0.0021588 -0.0195250  0.1194800  0.1934600 -0.4082000 -0.0829660  0.1662600 -0.1060100  0.3586100  0.1692200  0.0725900 -0.2480300 -0.1002400 -0.5249100 -0.1774500 -0.3664700  0.2618000 -0.0120770  0.0831900 -0.2152800  0.4104500  0.2913600  0.3086900
[183]  0.0788640  0.3220700 -0.0410230 -0.1097000 -0.0920410 -0.1233900 -0.1641600  0.3538200 -0.0827740  0.3317100 -0.2473800 -0.0489280  0.1574600  0.1898800 -0.0266420  0.0633150 -0.0106730  0.3408900  1.4106000  0.1341700  0.2819100 -0.2594000  0.0552670 -0.0524250 -0.2578900  0.0191270
[209] -0.0220840  0.3211300  0.0688180  0.5120700  0.1647800 -0.2019400  0.2923200  0.0985750  0.0131450 -0.1065200  0.1351000 -0.0453320  0.2069700 -0.4842500 -0.4470600  0.0033305  0.0029264 -0.1097500 -0.2332500  0.2244200 -0.1050300  0.1233900  0.1097800  0.0489940 -0.2515700  0.4031900
[235]  0.3531800  0.1865100 -0.0236220 -0.1273400  0.1147500  0.2735900 -0.2186600  0.0157940  0.8175400 -0.0237920 -0.8546900 -0.1620300  0.1807600  0.0280140 -0.1434000  0.0013139 -0.0917350 -0.0897040  0.1110500 -0.1670300  0.0683770 -0.0873880 -0.0397890  0.0141840  0.2118700  0.2857900
[261] -0.2879700 -0.0589960 -0.0324360 -0.0047009 -0.1705200 -0.0347410 -0.1148900  0.0750930  0.0995260  0.0481830 -0.0737750 -0.4181700  0.0041268  0.4441400 -0.1606200  0.1429400 -2.2628000 -0.0273470  0.8131100  0.7741700 -0.2563900 -0.1157600 -0.1198200 -0.2136300  0.0284290  0.2726100
[287]  0.0310260  0.0967820  0.0067769  0.1408200 -0.0130640 -0.2968600 -0.0799130  0.1950000  0.0315490  0.2850600 -0.0874610  0.0090611 -0.2098900  0.0539130

Context-window Size

The size of the context window determines which type of word meaning is represented in the embedding space

  • Small context windows (\(\pm\) 1-3 words) \(\rightarrow\) syntactic meaning

    • E.g. putting, bringing, taking, giving, providing, etc
  • Medium context windows (\(\pm\) 5-10 words) \(\rightarrow\) semantic meaning

    • E.g. crimes, crime, offences, offence, prosecutions, murder, etc
  • Large context windows (\(\pm\) 10+ words) \(\rightarrow\) topical meaning

    • E.g. tourism, visitors, museum, holiday, cafe, etc

\(\Rightarrow\) the size of the window will depend on the research question.

Embedding Dimensions

The size of the embedding vectors determines how complex the model is that we wish to fit

  • We have an embedding for each word, so increasing the embedding dimension by 1 multiplies the number of parameters to estimate by \(V\) (the total number of words in the vocabulary)

  • Too many dimensions: higher chance of modelling noise

  • Too few dimensions: higher chance of missing important subtleties in meaning

  • General guidance: about 150-300 is fine (though this is not a very satisfying answer)

What do word-embedding dimensions “mean”?

  • We can now generate multidimensional vectors for each of our words (we will see shortly) are very successful in capturing semantic relations among words
  • This implies that a meaningful semantic structure must be present in the respective vector spaces
  • However, it is very difficult to answer questions such as “what do high and low values of the \(i\)th embedding dimension mean?”
  • Example:
rownames(glove)[order(glove[,1], decreasing = F)][1:6]
[1] "samiul"     "stuffit"    "guangwei"   "decompress" "resend"     "sife"      
rownames(glove)[order(glove[,22], decreasing = F)][1:6]
[1] "cheesecloth" "globe.com"   "zubaie"      "metrohealth" "25-march"    "30-aug"     
rownames(glove)[order(glove[,300], decreasing = F)][1:6]
[1] "republish"     "12,000-page"   "transmittable" "affray"        "6-pica"        "spongiform"   
  • Generating “interpretable” word embeddings is the subject of ongoing work

Local Versus Pre-Trained

Either the Word2Vec or Glove methods can be applied to any large corpus of data. For applied research, there are typically two choices:

  • Locally-trained embeddings

    • Collect a large corpus
    • Estimate a word-embedding model
    • Use the word-embeddings
    • Advantages: Can capture “local” meanings of words which may differ from more general use
    • Distadvantages: More computationally expensive and requires more coding decisions/effort
  • Pre-trained embeddings

    • Download a pre-trained set of word-embeddings
    • Use the word-embeddings
    • Advantages: Usually high-quality embeddings trained on billions of texts
    • Disadvantages: May miss local variation in word meaning

Rodriguez and Spirling, 2020 suggest that pre-trained embeddings normally perform equally well as locally-trained variants on most tasks.

Visualisation

  • It is common to see visualisations of word embeddings in lower dimensions

  • There are many approaches to dimensionality reduction

    • Principal Component Analysis (PCA)
    • t-distributed stochastic neighbour embeddings (t-SNE)
  • These visualisations often give helpful insights into the ways that language is used in the data that was used to train the models

Break

Using Word-Embeddings

Similarity

  • A key advantage of word embeddings: we can compute the similarity between words (or collections of words)

  • The similarity between two words can be calculated as the cosine of the angle between the embedding vectors:

\[cos(\theta) = \frac{\mathbf{w}_i \cdot \mathbf{w}_j}{\left|\left| \mathbf{w}_i \right|\right| \left|\left| \mathbf{w}_j \right|\right|}\]

  • We can then sort the words in order of their similarity with the target word and report the “nearest neighbours”

Similarity Demonstration

library(text2vec)

# Extract target embedding
target <- glove[which(rownames(glove) %in% c("taxes", "quantitative", "enron")),]

# Calculate cosine similarity
target_sim <- sim2(glove,
                   target)

# Report nearest neighbours
sort(target_sim[,1], decreasing = T)[1:10]
    taxes       tax    income    paying  taxation       pay  revenues      fees    excise     costs 
1.0000000 0.8410683 0.6800698 0.6292870 0.6281890 0.6165617 0.5964931 0.5957133 0.5911929 0.5867159 
# Report nearest neighbours
sort(target_sim[,3], decreasing = T)[1:10]
 quantitative   qualitative     empirical   measurement      analysis   methodology    analytical      analyses methodologies     numerical 
    1.0000000     0.6452297     0.5144293     0.4902190     0.4807779     0.4792444     0.4602726     0.4536292     0.4420934     0.4269905 
# Report nearest neighbours
sort(target_sim[,2], decreasing = T)[1:10]
     enron   worldcom   skilling     fastow     dynegy   andersen executives accounting        aig   auditors 
 1.0000000  0.6551277  0.6350623  0.5476151  0.5255642  0.5219762  0.5188566  0.5134948  0.4963848  0.4742722 

Analogies

  • One surprising feature of word-embeddings is that they can capture more nuanced features of language than simple similarity

  • Among the most widely discussed features of word embeddings is their ability to capture analogies via their geometry

  • Analogies are linguistic expressions which describe processes of transfering information from one subject (the analogue) to another (the target)

  • Example:

    • Apple is to tree as grape is to ____
    • King is to man as _____ is to woman
  • Word embeddings have some ability to “solve” analogies of this form using vector addition and subtraction

\[\text{vector(king)} - \text{vector(man)} + \text{vector(woman)} \approx \text{vector(queen)}\]

Example process:

  1. Compute vector \(\text{vector(king)} - \text{vector(man)} + \text{vector(woman)}\)

  2. Calculate cosine similarity between new vector and all word vectors

  3. Report most similar vectors (normally excluding those for king, man and woman)

Analogies Demonstration

# Extract vectors
king <- glove[which(rownames(glove) == "king"),]
man <- glove[which(rownames(glove) == "man"),]
woman <- glove[which(rownames(glove) == "woman"),]

# Generate analogy vector
target <- king - man + woman

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]
     king     queen   monarch    throne  princess    mother  daughter   kingdom    prince elizabeth 
0.8065858 0.6896163 0.5575491 0.5565375 0.5518684 0.5142154 0.5133157 0.5025345 0.5017740 0.4908031 
# Extract vectors
paris <- glove[which(rownames(glove) == "paris"),]
france <- glove[which(rownames(glove) == "france"),]
germany <- glove[which(rownames(glove) == "germany"),]

# Generate analogy vector
target <- paris - france + germany

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]
   berlin frankfurt   germany    munich   cologne      bonn    vienna   hamburg   leipzig    german 
0.8082348 0.7182159 0.6976348 0.6616810 0.6388244 0.6297188 0.6096600 0.6015804 0.5951980 0.5929443 
# Extract vectors
taller <- glove[which(rownames(glove) == "taller"),]
tall <- glove[which(rownames(glove) == "tall"),]
thin <- glove[which(rownames(glove) == "thin"),]

# Generate analogy vector
target <- taller - tall + thin

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]
   thinner       thin    thicker     taller    slimmer   narrower noticeably     softer   slightly     weaker 
 0.6823011  0.6795648  0.6409053  0.4766686  0.4659788  0.4645112  0.4491847  0.4438605  0.4299319  0.4269721 

Implication: Word-embedding vectors encode certain linguistic regularities that relate to the relation between different words.

Caveats:

  1. Only works with reasonably common words

  2. Only works for certain relations, but not others

  3. Understanding analogy is an open area for research

Dictionary Expansion

  • One helpful application of the word-similarity properties we have just discussed is that we can use them to automatically build more complete dictionaries

  • Process

    1. Start with a small “seed” dictionary

    2. Calculate the average embedding of the words in the dictionary

    3. Calculate the cosine similarity between the dictionary embedding and all other words

    4. Report the most similar words and use them to extend the original dictionary

  • This approach can enable us to find words associated with our concept of interest but which may not occur to the research a priori

Dictionary Expansion Application

# Define seed dictionary
seed_dictionary <- c("hate", "dislike", "despise")

# Extract seed words 
hate_words <- glove[which(rownames(glove) %in% seed_dictionary),]

# Calculate mean embedding
hate_words_vec <- colMeans(hate_words)

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(hate_words_vec, nrow = 1))

# Print output
names(sort(target_sim[,1], decreasing = T))[1:40]
 [1] "despise"     "dislike"     "hate"        "loathe"      "hatred"      "hated"       "detest"      "hates"       "disdain"     "distrust"    "adore"       "resent"      "disliked"    "distaste"    "hating"      "dislikes"    "admire"      "loathing"    "feelings"    "bigotry"     "antipathy"  
[22] "affection"   "envy"        "animosity"   "intolerance" "equate"      "abhor"       "despises"    "hostility"   "despised"    "profess"     "admiration"  "criticize"   "liking"      "fear"        "perceive"    "disrespect"  "mistrust"    "disliking"   "hateful"    

Applications

Application 1

Are female politicians less aggressive than male politicians? (Hargrave and Blumenau, 2022)

In lecture 2 we investigated the claim that male and female politicians have distinct styles. Previously, we applied an existing sentiment dictionary to a corpus of parliamentary texts. Today, we will supplement this approach by using word-embeddings to automatically expand the set of words we use to score speeches.

Aggressive Word Dictionary

library(quanteda)
aggression_words <- read.csv("aggression_words.csv")[,1]
print(aggression_words)
  [1] "irritated"         "stupid"            "stubborn"          "accusation"        "acuse"             "accusations"       "accusing"          "anger"             "angered"           "annoyance"         "annoyed"           "attack"            "insult"            "insulting"        
 [15] "insulted"          "betray"            "betrayed"          "blame"             "blamed"            "blaming"           "bitter"            "bitterly"          "bitterness"        "complain"          "complaining"       "confront"          "confrontation"     "fibber"           
 [29] "fabricator"        "phony"             "fibber"            "sham"              "deceived"          "deceive"           "disgrace"          "villain"           "good-for-nothing"  "hypocrite"         "deception"         "steal"             "needlessly"        "needless"         
 [43] "criticise"         "criticised"        "criticising"       "blackened"         "fiddled"           "fiddle"            "problematic"       "lawbreakers"       "offenders"         "offend"            "unacceptbale"      "leech"             "phoney"            "appalling"        
 [57] "incapable"         "farcical"          "absurd"            "ludicrous"         "nonsense"          "laughable"         "nonsensical"       "ridiculous"        "outraged"          "hysterial"         "adversarial"       "aggressive"        "shady"             "stereotyping"     
 [71] "unhelpful"         "unnatural"         "assaulted"         "assault"           "assaulting"        "half-truths"       "petty"             "humiliate"         "humiliating"       "confrontational"   "hate"              "hatred"            "furious"           "hostile"          
 [85] "hostility"         "nasty"             "obnoxious"         "sleeze"            "sleezy"            "inadequacy"        "faithless"         "neglectful"        "neglect"           "neglected"         "wrong"             "failure"           "failures"          "failed"           
 [99] "fail"              "scapegoat"         "cruel"             "cruelty"           "demonise"          "demonised"         "tactic"            "trick"             "trickery"          "deceit"            "dishonest"         "deception"         "devious"           "deviouness"       
[113] "shenanigans"       "fraudulence"       "fraudulent"        "fraud"             "swindling"         "archaic"           "sly"               "slyness"           "silly"             "silliness"         "scandal"           "scandalous"        "slander"           "slanderous"       
[127] "libellous"         "disreputable"      "dishonourable"     "shameful"          "atrocious"         "gimmick"           "immoral"           "ridicule"          "antagonistic"      "antagonise"        "ill-mannered"      "spiteful"          "spite"             "vindictive"       
[141] "prejudice"         "prejudices"        "disregard"         "arrogant"          "arrogance"         "embarrasment"      "embarrass"         "embarrasing"       "distasteful"       "provoke"           "provoked"          "petulant"          "ignorance"         "stupidity"        
[155] "idiot"             "idiotic"           "annoying"          "dodgy"             "untrue"            "penny-pinching"    "attacking"         "ironic"            "irony"             "outrageous"        "hackery"           "crass"             "backchat"          "rude"             
[169] "ill-judged"        "ragbag"            "mess"              "hash"              "fiasco"            "shambles"          "shambolic"         "farce"             "botch"             "botched"           "blunder"           "mischievous"       "mischief"          "undermine"        
[183] "straightjacket"    "groan"             "abuse"             "chaos"             "chaotic"           "dull"              "predictable"       "negligent"         "grotesque"         "scapegoats"        "hypocrisy"         "bogus"             "counterproductive" "betrayal"         
[197] "patronise"         "patronising"       "reprehensible"     "fool"              "foolish"           "abysmal"           "disgraceful"       "woeful"            "inferior"          "sneaky"            "scaremongering"    "scaremonger"       "coward"            "cowardly"         
[211] "ignorant"          "intolerant"        "unacceptbale"      "condemn"           "short-sighted"     "ashamed"           "falsehood"         "blackmail"         "clownery"          "debased"           "debase"            "hypocracy"         "mislead"           "misleading"       
[225] "smokescreen"       "subterfuge"        "horrendous"        "despicable"        "deplorable"       

Although this is a reasonable-looking list of aggressive words, are there other words that MPs might use to criticise each other in parliamentary debate?

Estimating Word Embeddings

  • In this instance, we will use the Glove model to estimate a local set of word-embeddings

  • The hope is that this will allow us to pick up on the ways in which aggressive words are used in the specific context of parliamentary debate

library(text2vec)

# Load data

load("debates_fcm.Rdata")

## Fit GLOVE model

glove = GlobalVectors$new(rank = 150, x_max = 2500L)
debate_main = glove$fit_transform(debates_fcm, n_iter = 500, convergence_tol = 0.005, n_threads = 3,   learning_rate = 0.14)

## Extract word embeddings
debate_context = glove$components
  
word_vectors = debate_main + t(debate_context)

save(word_vectors, file = "word_vectors_150.Rdata")
str(word_vectors)
 num [1:14683, 1:150] 0.0983 0.1094 0.2122 0.1579 0.1368 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:14683] "house" "proceeds" "choice" "speaker" ...
  ..$ : NULL

Incorporating Word Embeddings

With our word-embeddings in hand, we can then use them to create a dictionary embedding by averaging over the embeddings for each word:

## Extract word embeddings of words in dictionary
target_words <- word_vectors[aggression_words,] 

## Calculate mean embedding for this dictionary
target_vector <- colMeans(target_words)

## Distance between each word in the vocabulary and the mean embedding
cos_sim <- sim2(word_vectors, 
                matrix(target_vector, nrow = 1)) 

## Store results
word_scores <- data.frame(score = cos_sim[,1], 
                          in_original_dictionary = dimnames(cos_sim)[[1]]%in%aggression_words)

Incorporating Word Embeddings

word_scores <- word_scores[order(word_scores$score, decreasing = T),]
head(word_scores, 30)
                   score in_original_dictionary
disgraceful    0.6828765                   TRUE
shameful       0.6611104                   TRUE
outrageous     0.6553395                   TRUE
scaremongering 0.6348777                   TRUE
utterly        0.6147277                  FALSE
cynical        0.6142637                  FALSE
frankly        0.6091110                  FALSE
scandalous     0.6077674                   TRUE
dishonest      0.6039909                   TRUE
embarrassing   0.5921001                  FALSE
absurd         0.5897929                   TRUE
ridiculous     0.5887514                   TRUE
ludicrous      0.5873914                   TRUE
deplorable     0.5846311                   TRUE
incompetence   0.5773249                  FALSE
misguided      0.5683095                  FALSE
irresponsible  0.5675159                  FALSE
pathetic       0.5667197                  FALSE
appalling      0.5536031                   TRUE
dreadful       0.5514060                  FALSE
nonsense       0.5435853                   TRUE
bizarre        0.5403646                  FALSE
complacency    0.5319133                  FALSE
ashamed        0.5275484                   TRUE
illogical      0.5250856                  FALSE
arrogant       0.5205944                   TRUE
incompetent    0.5176657                  FALSE
shocking       0.5158744                  FALSE
accusation     0.5158350                   TRUE
arrogance      0.5152852                   TRUE

Scoring Speeches

  • In addition to using this approach to finding words we might have missed, we now have scores associated with each word that indicate the relevance of the word to the concept of interest
  • We can use these word-weights to score individual speeches

\[Score_i = \frac{\sum_w^W Sim_w N_{w,i}}{\sum_w^W N_{w,i}}\] - \(Sim_w\) is the similarity score for each word embedding and the dictionary embedding

  • \(N_{w,i}\) is the tf-idf count of the word in the speech

  • \(Score_i\) therefore represents the fraction of words in sentence \(i\) that are relevant to the concept contained in the seed dictionary

  • Key advantage: speech scores reflect the ways that aggressive words are used in the context of parliamentary debate

Comparison with Traditional Dictionaries

Application 2

How do words change in meaning over time? (Hamilton et. al., 2018)

Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Hamilton et. al. estimate (co-occurence vectors and Word2Vec) word-embeddings on the Google Books corpus to evaluate changes in word-meaning over 2 centuries.

  • Frequent words change meaning at a slow rate, rare words change meaning faster
  • Words with multiple meanings change at a fast rate, words with single meanings change slower

Extensions

Bias in Word-Embeddings

  • An important substantive finding about word-embedding methods is that they can learn human biases in the semantic relationsips they encode into the vector space
  • This occurs because they are trained on human-generated data: if biased relations between words occur frequently in natural language texts, the word-embeddings learn those biases

“There is nothing about doing data analysis that is neutral. What and how data is collected, how the data is cleaned and stored, what models are constructed, and what questions are asked – all of this is political.” Danah Boyd, NYU

  • Another important theme of current work is in “de-biasing” word-embedding methods

Bias in Word-Embeddings, Example

# Extract vectors
doctor <- glove[which(rownames(glove) == "doctor"),]
father <- glove[which(rownames(glove) == "father"),]
mother <- glove[which(rownames(glove) == "mother"),]

# Generate analogy vector
target <- (doctor - father) + mother

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]
   doctor     nurse   doctors     woman   patient    mother physician  pregnant  hospital   medical 
0.8397708 0.6648028 0.6255664 0.5923487 0.5839312 0.5719679 0.5527085 0.5417390 0.5404372 0.5336439 

Racial Bias in Word-Embeddings (Garg et al., 2018, PNAS)

Polysemes

  • A polyseme is a word or phrase that has multiple meanings
  • For example:

    • Pike

      • A sharp point or staff
      • A type of elongated fish
      • A railroad line
      • The future (coming down the pike)
      • A type of body position (as in diving)
      • Etc
  • We typically have a single word-embedding vector for each word, which will be a (weighted) average of the contexts in which these different usages occur
  • Contextual Word Embeddings (BERT; ELMo) are one potential solution: estimate a separate vector for every token, not just every type

Contextual Word Embeddings

Rodriguez et. al., 2023 provide a demonstration of combining regression methods with contextual word-embeddings to answer substantive social science research questions.

Nearest Neighbours of “Equality” in 1885 and 2005
Equality: 1885 Equality: 2005
enactment gender
abolition gays
slavery lesbians
amendment transgender
abrogation lgbt
Nearest Neighbours of “Trump” and “trump”
Trump trump
president declarer
impeaching trumps
assailing colloquies
president-elect four-point
impeach upend

Conclusion

Summing Up

  • Word embedding methods provide a representation of word “meaning” by encoding information about the contexts in which words occur

  • These vector result in a rich representation that allow us to measure the similarity between different words

  • There are several modelling decisions to make when estimating word embeddings, including modelling approach, context-window size, and embedding length

  • Embeddings can be a useful addition to many of the methods we have previously studied on the course

Seminars

In today’s seminar, we will learn how to use pre-trained word embeddings to measure word similarities, compute analogy tasks, and to conduct dictionary expansion.

(Please download the GloVe embeddings from the link on the course webpage before coming to class.)