7: Word Embeddings

Jack Blumenau

Module feedback

Could we also go over the application of ‘content’ argument for the topic model (in stm function)?

Can we have the code that produce the graph you showed us in the lecture sildes?

Please give us more guidance in cleaning text related data to better conduct analysis.

Response: I will add materials on these questions to the course website.

Module feedback

have difficulty in coming up with research questions and get access to appropriate data sources

Please could you release examples of good research projects from students from previous years?

Response: I will add some examples to the course website. If you are struggling, please come to see me during office hours.

Module feedback

The slides are a bit dense, it can make it difficult to focus on the content when you’re fighting to get everything written down.

We go through content quite quickly, and it’s hard to get all the seminar work done in class, but I’m liking the course!

The lecture slides are very long and hard to get through. We understand there’s so much to get through but sometimes it’s too much The seminar is amazing

Response: I will try to slim things down and to go slower

Introduction to Word Embeddings

Sparse Representations of Words

Up until this point of the course, we have always implicitly used representations of words that are sparse.

Words were represented as one-hot encoding, i.e. word-specific vectors that take the value of 1 only for that word, and 0 for all others. E.g.

\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0, & 0, & 1, & 0, & ..., & 0 \end{bmatrix} \end{align}\]

\[\begin{align} w_{\text{deficit}} &= \begin{bmatrix} 0, & 0, & 0, & 1, & ..., & 0 \end{bmatrix} \end{align}\]

The problem with this representation is that they contain no notion of similarity between words.

The dot product between two word vectors is zero:

\[ cos(\theta) = w_{\text{deficit}}^Tw_{\text{debt}} = \frac{\mathbf{w_{\text{deficit}}} \cdot \mathbf{w_{\text{debt}}}}{\left|\left| \mathbf{w_{\text{deficit}}} \right|\right| \left|\left| \mathbf{w_{\text{debt}}} \right|\right|}= 0 \]

This is true for any pair of words, which is clearly nonsense as some pairs of words are more similar to each other than other pairs of words.

Problems with Sparse Word Representations

Mechanical problems:

Similarity
- Documents might have zero term overlap, but have nearly identical meanings
- E.g. “Quantitative text analysis is very successful.” vs “Natural language processing is tremendously effective.”
Classification/Dictionaries/Supervised scaling
- We may know or learn that one word is connected to a concept, but that doesn’t tell us anything about other similar words
- If we learn that “turmeric” is highly predictive of the concept of interest, shouldn’t we also learn something about “garlic”, “saffron”, and “ginger”?
Topic models/Unsupervised scaling
- If “bank”, “economy”, “interest”, and “rates” have high probability under a topic, shouldn’t “monetary” also have high probability?

Substantive problem: Our representation does not encode any information about the “meaning” of a word. The binary representation implies that all words are distinct to an equal degree, which seems a strong assumption.

$\Rightarrow$ We would prefer a representation that allows us to capture similarities between word meanings.

Distributional Semantics

The distributional hypothesis: the meaning of a word can be derived from the distribution of contexts in which it appears.

We can learn about the meaning of a word by investigating the distribution of words that show up around the word
- “You shall know a word by the company it keeps!” J.R. Firth (1957)
- “The meaning of words lies in their use” Ludwig Wittgenstein (1953)
The hypothesis implies that words that appear in similar “contexts” will share similar meanings
This simple (and old) idea is one of the most influential and successful ideas in modern natural language processing
Word embedding approaches represent the distributional “meaning” of a word as a vector in multidimensional space

Distributional Semantics

When a word $j$ appears in a text, its “context” is the set of words that appear nearby (within a fixed-size window)
We use the many contexts of $w$ to build up a representation of $w$

pre	keyword	post
can be delivered for the	banking	industry in Europe . I
instance I am referring to	banking	. It is not only
that , if the second	banking	directive comes into force without
the future of the British	banking	industry within the European Community
the future of the British	banking	industry within the European Community
the Government expect the second	banking	directive to come into force

pre	keyword	post
during the passage of the	Finance	Bill , but I can
is referring to taxpayers '	finance	and public sector funding ,
world when it comes to	finance	. The industry is being
of the European Council of	Finance	Ministers with his Community colleagues
of the European Council of	Finance	Ministers with his Community colleagues
practically involved in local government	finance	. It must have come

Word Embedding Overview

The meaning of each word is based on the distribution of terms with which it co-occurs
We represent this meaning using a vector for each word
Vectors are constructed such that similar words are close to each other in “semantic” space
We build this space automatically by seeing which words are close to one another in texts

Dense Representations of Words

Our goal will be to build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts (measuring similarity as the dot product)

\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0.73 \\ 0.04 \\ 0.07 \\ -0.18 \\ 0.81 \\ -0.97 \end{bmatrix} \end{align}\]

\[\begin{align} w_{\text{deficit}} &= \begin{bmatrix} 0.63 \\ .14 \\ .02 \\ -0.58 \\ 0.43 \\ -0.66 \end{bmatrix} \end{align}\]

These representations are known as word embeddings because we “embed” words into a low-dimensional space (low compared to the vocabulary size).

Advantages of Word Embeddings

Low-dimensional word embeddings offer three core advantages over simple word counts. They:

Encode similarity between words
- We no longer have word similarities of zero!
- Each word is a vector, with the vectors of similar words closer together than vectors of very different words
Allow for “automatic generalization”
- Imagine that we discover “fantastic” is a good predictor of positive reviews, but we never observe the word “extraordinary” in our training corpus
- Because “fantastic” and “extraordinary” will have similar word vectors, we can share information across words, and apply what we have learned about one word to our understanding of another
- Can lead to large performance gains for prediction and topic modelling tasks
Provide a measure of meaning
- We will often be interested in the meaning of words as a quantity in its own right
- Very specific meaning of “meaning”: two words share a meaning when used in similar contexts

Contrasting Approaches

The material that we study this week is different on several dimensions from the approaches on the course so far:

	Traditional Unsupervised	Traditional Supervised	Word Embeddings
Bag of words?	Yes	Yes	No
Inputs	DFM	DFM	Feature co-occurence matrix
Outputs	Topics;Scales	Categorization	Word vectors

Estimating Word Embeddings

Design Choices in Word Embeddings

Data
- High-quality embeddings require a large amount of training data
- Usually trained on large external corpora (i.e. wikipedia; news articles; web pages)
- Embeddings will reflect the language used in the training documents

Context Window Size
- If meaning is defined by a word’s context, we need to define context
- Usually implemented as a symmetric window of some length around each word
- The size of the window dictates the kind of information the embedding will capture

Dimension of the Embedding
- The embedding for each word will typically be between 50 and 500 elements long
- The embedding encodes information about the contexts a word appears in, so in theory larger embeddings are able to encode more information
- In practice, medium-dimensional embeddings (100-300) work very well

Algorithm
- There are several algorithms that allow us to learn embeddings from data
- Count-based, sparse approaches (e.g. co-occurence vectors)
- Neural network, dense approaches (e.g “Word2Vec” and “GloVe”)

Co-occurence Vectors

One simple “embedding” is produced by counting the occurrences of any term within a fixed window of any other term and store these in a feature-cooccurence matrix:

debates_fcm <- 
  debates_corpus %>%
  tokens() %>%
  tokens_tolower() %>%
  fcm(context = "window",
      window = 3,
      tri = FALSE)

save(debates_fcm, file = "../data/debates_fcm.Rdata")

debates_fcm

Feature co-occurrence matrix of: 327,761 by 327,761 features.
          features
features   before     the  house proceeds      to choice      of       a speaker       ,
  before      948  119351  17939       89   28688    213   21405   18847     277   47978
  the      119351 3792520 580515     4167 3629935  13840 6681402  628635   18558 3962528
  house     17939  580515   2006      112  150334    176  211732   33340    1401  124961
  proceeds     89    4167    112        6     817      1    2572     197       0     870
  to        28688 3629935 150334      817 1053898   7231  736710  957469    5363 1291597
  choice      213   13840    176        1    7231    298    9407   10100      31    8430
  of        21405 6681402 211732     2572  736710   9407  755200 1244591    2787 1491028
  a         18847  628635  33340      197  957469  10100 1244591  161468    2751  985450
  speaker     277   18558   1401        0    5363     31    2787    2751     412   83721
  ,         47978 3962528 124961      870 1291597   8430 1491028  985450   83721 1766522
[ reached max_feat ... 327,751 more features, reached max_nfeat ... 327,751 more features ]

The word “house” appears within three words of the word “before” 17939 times in this corpus
The word “speaker” never appears within three words of “proceeds” in this corpus
Etc

Typically, these vectors are then weighted by the pointwise mutual information:

\[PMI(w,c) = log\frac{P(w,c)}{P(w)P(c)}\]

$\rightarrow$ PMI compares the probability of two words occurring together to what this probability would be if the words were unrelated (independent in probabilities)

$\rightarrow$ larger weight on feature combinations that are more frequent, lower weights on features that commonly occur with many other features (similar intuition to tf-idf).

Co-occurence Vectors

Given this representation, we can calculate the cosine similarity between the word vectors of target words to find the closest other words in the embedding space:

library(quanteda.textstats)

word_similarities <- textstat_simil(debates_fcm,
                                    debates_fcm[which(featnames(debates_fcm) %in% c("election", "health", "banking")),],
                                    method = "cosine",
                                    margin = "documents")


sort(word_similarities[,1], decreasing = TRUE)[1:10]

  election  elections referendum    general  electoral    elected  manifesto      party       vote     labour 
 1.0000000  0.2697683  0.2190709  0.2078489  0.1907036  0.1864098  0.1858062  0.1847441  0.1841158  0.1833788

sort(word_similarities[,2], decreasing = TRUE)[1:10]

    health     mental        nhs       care    service   services     social healthcare  education    medical 
 1.0000000  0.3688996  0.2943339  0.2893435  0.2541648  0.2453370  0.2348707  0.2285000  0.2264886  0.2239171

sort(word_similarities[,3], decreasing = TRUE)[1:10]

     banking        banks      lending    financial         bank institutions    corporate     consumer   regulatory      markets 
   1.0000000    0.2769608    0.2318190    0.2301063    0.2215138    0.2132605    0.2086125    0.2057390    0.2057376    0.2046487

Co-occurence Vectors

Co-occurence vectors clearly capture something about the meaning of words
However, these vectors increase in the size of the vocabulary of the corpus
- The vectors above each have length 327761
One consequence is that they tend to be very sparse (most words fail to occur with most other words)
- E.g. the sparsity of the example above is 99.9%
In most applications, sparse vectors like these tend to perform less well than dense vectors
- Similarity; classification; unsupervised learning, etc
We would therefore prefer a low-dimensional representation that didn’t suffer from these sparsity issues

Word-2-Vec Overview

Word2Vec (Mikolov et al, 2013) is a set of related methods for learning dense word vectors.

One version of Word2Vec – skip-gram with negative sampling – follows this basic process:

Start with a very large corpus of text (i.e. all of Wikipedia)
Represent each word in the vocabulary as a vector, $\mu_j$
- Initialise each vector with random numbers
Go through each position, $t$, in the text, where each position has
- A center word, $t$ (“target” words)
- Context words, $o$ (“outside” words)
Calculate the probability of observing $o$ given $t$ (or vice versa), using the similarity of the word vectors for $o$ and $t$
Adjust the values of the word vectors, $\mu_j$, to …
- …maximize the probability of observing true context words
- …minimize the probability of observing other words from the corpus

Word-2-Vec Intuition

What is the probability of observing the context words given the center word, ‘UN’?
\[\begin{equation} \text{important }\underbrace{\text{\textcolor{orange}{to get }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{UN }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{agreement as }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{the last hope of demonstrating international agreement }\end{equation}\]$p(w_o = \text{to}|w_t = \text{UN})$
$p(w_o = \text{get}|w_t = \text{UN})$
$p(w_o = \text{agreement}|w_t = \text{UN})$
$p(w_o = \text{as}|w_t = \text{UN})$

What is the probability of observing the context words given the center word, ‘agreement’?
\[\begin{equation} \text{important to }\underbrace{\text{\textcolor{orange}{get UN }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{agreement }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{as the }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{last hope of demonstrating international agreement }\end{equation}\]$p(w_o = \text{get}|w_t = \text{agreement})$
$p(w_o = \text{UN}|w_t = \text{agreement})$
$p(w_o = \text{as}|w_t = \text{agreement})$
$p(w_o = \text{the}|w_t = \text{agreement})$

What is the probability of observing the context words given the center word, ‘as’?
\[\begin{equation} \text{important to get }\underbrace{\text{\textcolor{orange}{UN agreement }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{as }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{the last }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{hope of demonstrating international agreement }\end{equation}\]$p(w_o = \text{UN}|w_t = \text{as})$
$p(w_o = \text{agreement}|w_t = \text{as})$
$p(w_o = \text{the}|w_t = \text{as})$
$p(w_o = \text{last}|w_t = \text{as})$

What is the probability of observing the context words given the center word, ‘the’?
\[\begin{equation} \text{important to get UN }\underbrace{\text{\textcolor{orange}{agreement as }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{the }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{last hope }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{of demonstrating international agreement }\end{equation}\]$p(w_o = \text{agreement}|w_t = \text{the})$
$p(w_o = \text{as}|w_t = \text{the})$
$p(w_o = \text{last}|w_t = \text{the})$
$p(w_o = \text{hope}|w_t = \text{the})$

What is the probability of observing the context words given the center word, ‘last’?
\[\begin{equation} \text{important to get UN agreement }\underbrace{\text{\textcolor{orange}{as the }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{last }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{hope of }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{demonstrating international agreement }\end{equation}\]$p(w_o = \text{as}|w_t = \text{last})$
$p(w_o = \text{the}|w_t = \text{last})$
$p(w_o = \text{hope}|w_t = \text{last})$
$p(w_o = \text{of}|w_t = \text{last})$

What is the probability of observing the context words given the center word, ‘hope’?
\[\begin{equation} \text{important to get UN agreement as }\underbrace{\text{\textcolor{orange}{the last }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\underbrace{\text{\textcolor{red}{hope }}}_{\substack{\text{Center word at }\\ \text{position $t$}}}\underbrace{\text{\textcolor{orange}{of demonstrating }}}_{\substack{\text{Context words}\\ \text{window of size 2}}}\text{international agreement }\end{equation}\]$p(w_o = \text{the}|w_t = \text{hope})$
$p(w_o = \text{last}|w_t = \text{hope})$
$p(w_o = \text{of}|w_t = \text{hope})$
$p(w_o = \text{demonstrating}|w_t = \text{hope})$

Word2Vec Objective Function

The objective of the Word2Vec model is to maximise the average log probability:

\[ \frac{1}{T}\sum_{t=1}^T \sum_{-c \leq j \leq c, j\neq0} \text{log} p(w_{t+j}|w_t) \]

where probability, $p(w_{t+j}|w_t)$, is defined as:

\[\begin{equation} p(w_{t+j}|w_t) = \frac{\text{exp}(v_{o}^T \cdot v_{t})}{\sum_{w=1}^W\text{exp}(v_{o}^T \cdot v_{t})} \end{equation}\]

This is an example of the softmax function, which maps arbitrary values to a probability distribution.

Goal: Learn values of the word-embeddings, $v_w$, that maximise the probability of observing the context words that we actually do observe.

Approach: We learn the values of the word-embeddings by stochastic gradient descent – we gradually adjust the values of the embeddings to maximise the probabilities above

Doing no more than this allows the algorithm to learn word vectors that capture word similarity and meaningful directions in a word space

Learning Skip-Gram Embeddings (Negative Sampling)

The denominator of the softmax function defined above is very computationally expensive to evaluate repeatedly and so Word2Vec recasts the problem as supervised learning problem.

Select word from position $t$ and select the “positive” words ($Y=1$) that fall in its context (i.e. $\pm 2$ words)

For each true context word, select $K$ “negative” context words ($Y=0$) at random from the entire corpus

Run a logistic regression, with the positive/negative variable as outcome, and the dot product between the words’ embeddings as predictor ($v_{o}^T \cdot v_{t}$)

Adjust the values of the embeddings to better distinguish between the positive and negative context words

$t$	$o$	$Y$
UN	to	1
UN	get	1
UN	agreement	1
UN	as	1

$t$	$o$	$Y$
UN	to	1
UN	get	1
UN	agreement	1
UN	as	1
UN	castle	0
UN	when	0
UN	chair	0
UN	yoyo	0
UN	pancake	0
UN	whilst	0
UN	tremmor	0
UN	foot	0

$t$	$o$	$Y$	$v_{o}^T \cdot v_{t}$
UN	to	1	0.172
UN	get	1	0.226
UN	agreement	1	0.619
UN	as	1	0.351
UN	castle	0	0.112
UN	when	0	-0.559
UN	chair	0	-0.323
UN	yoyo	0	0.371
UN	pancake	0	0.106
UN	whilst	0	0.021
UN	tremmor	0	0.052
UN	foot	0	0.105

$t$	$o$	$Y$	$v_{o}^T \cdot v_{t}$
UN	to	1	0.279
UN	get	1	0.465
UN	agreement	1	0.36
UN	as	1	0.164
UN	castle	0	-0.174
UN	when	0	-0.23
UN	chair	0	0.043
UN	yoyo	0	-0.19
UN	pancake	0	-0.049
UN	whilst	0	-0.106
UN	tremmor	0	-0.02
UN	foot	0	-0.208

The goal of the learning algorithm is to maximise the similarity of the target and positive context vectors, and minimize the similarity between the target and negative context word vectors.

Word2Vec Intuition

Question: Why do the estimated word-embeddings encode information about word similarity?

The predicted probability of a context word is high when the dot product between the context word’s embedding and the target word’s embedding is high
- $p(w_{t+j}|w_t) = \frac{\text{exp}(v_{o}^T \cdot v_{t})}{\sum_{w=1}^W\text{exp}(v_{o}^T \cdot v_{t})}$
- $v_{o}^T \cdot v_{t} = \sum_{i=1}^n v_{o,i}v_{t,i}$
This encourages the model to find embedding vectors that are similar to one another for words that occur together frequently in the corpus
It also encourages the model to find embedding vectors that are similar to one another for words that appear in similar contexts in the corpus, even if they rarely appear together. E.g.
- “worldcom” and “scandal” appear frequently together
- “enron” and “scandal” appear frequently together
- But “worldcom” and “enron” appear infrequently together
- “worldcom” and “enron” will still need relatively proximate embeddings in order to predict the occurrence of “scandal” in both contexts

GloVe: Global Vectors for Word Representation

The Glove algorithm builds directly on the idea of the co-occurence vectors that we discussed previously

Weighted least squares model that learns dense vectors from the word-word co-occurence counts

GloVe models the log of the number of times that each word appears in the context of each other word:

\[\min_\theta J(\theta) \ \ \ \text{where} \ \ \ J(\theta) = \sum_{i=1}^V\sum_{i,j=1}^Vf(X_{i,j}) (v_i^T \cdot v_j - \text{log}(X_{i,j}))^2\]

Where
- $X$ is a word-word co-occurence matrix
- $X_{i,j}$ is the number of times word $j$ appears in the context of word $i$

Intuition:

The GloVe model tries to make it so that the dot product between the word vectors for $i$ and $j$ are equal to the log of the co-occurrence between the words

$f(X_{i,j})$ is a weighting function that puts somewhat higher weights on more infrequent words (so that very common words do not dominate)

GloVe vs Word2Vec

The core difference between the two models is that GloVe is a model for the global co-occurence counts, while Word2Vec is an “online” model which trains progressively on a moving window

The GloVe model has some advantages over Word2Vec
- Very fast
- Easily scales to very large corpora
- Good performance on small corpora

However, across many practical applications, there is no clear evidence that one model outperforms the other (Rodriguez and Spirling, 2020)

In both models, the researcher has to make several decisions that can be consequential to the estimated word vectors
- Context-window size
- Embedding dimensionality
- Pre-trained versus local fit

Glove Embeddings

glove <- readRDS("../data/glove.rds")

str(glove)

 num [1:400000, 1:300] 0.0466 -0.2554 -0.1256 -0.0769 -0.2576 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:400000] "the" "," "." "of" ...
  ..$ : NULL

glove[1,]

  [1]  0.0465600  0.2131800 -0.0074364 -0.4585400 -0.0356390  0.2364300 -0.2883600  0.2152100 -0.1348600 -1.6413000 -0.2609100  0.0324340  0.0566210 -0.0432960 -0.0216720  0.2247600 -0.0751290 -0.0670180 -0.1424700  0.0388250 -0.1895100  0.2997700  0.3930500  0.1788700 -0.1734300 -0.2117800
 [27]  0.2361700 -0.0636810 -0.4231800 -0.1166100  0.0937540  0.1729600 -0.3307300  0.4911200 -0.6899500 -0.0924620  0.2474200 -0.1799100  0.0979080  0.0831180  0.1529900 -0.2727600 -0.0389340  0.5445300  0.5373700  0.2910500 -0.0073514  0.0478800 -0.4076000 -0.0267590  0.1791900  0.0109770
 [53] -0.1096300 -0.2639500  0.0739900  0.2623600 -0.1508000  0.3462300  0.2575800  0.1197100 -0.0371350 -0.0715930  0.4389800 -0.0407640  0.0164250 -0.4464000  0.1719700  0.0462460  0.0586390  0.0414990  0.5394800  0.5249500  0.1136100 -0.0483150 -0.3638500  0.1870400  0.0927610 -0.1112900
 [79] -0.4208500  0.1399200 -0.3933800 -0.0679450  0.1218800  0.1670700  0.0751690 -0.0155290 -0.1949900  0.1963800  0.0531940  0.2517000 -0.3484500 -0.1063800 -0.3469200 -0.1902400 -0.2004000  0.1215400 -0.2920800  0.0233530 -0.1161800 -0.3576800  0.0623040  0.3588400  0.0290600  0.0073005
[105]  0.0049482 -0.1504800 -0.1231300  0.1933700  0.1217300  0.4450300  0.2514700  0.1078100 -0.1771600  0.0386910  0.0815300  0.1466700  0.0636660  0.0613320 -0.0755690 -0.3772400  0.0158500 -0.3034200  0.2837400 -0.0420130 -0.0407150 -0.1526900  0.0749800  0.1557700  0.1043300  0.3139300
[131]  0.1930900  0.1942900  0.1518500 -0.1019200 -0.0187850  0.2079100  0.1336600  0.1903800 -0.2555800  0.3040000 -0.0189600  0.2014700 -0.4211000 -0.0075156 -0.2797700 -0.1931400  0.0462040  0.1997100 -0.3020700  0.2573500  0.6810700 -0.1940900  0.2398400  0.2249300  0.6522400 -0.1356100
[157] -0.1738300 -0.0482090 -0.1186000  0.0021588 -0.0195250  0.1194800  0.1934600 -0.4082000 -0.0829660  0.1662600 -0.1060100  0.3586100  0.1692200  0.0725900 -0.2480300 -0.1002400 -0.5249100 -0.1774500 -0.3664700  0.2618000 -0.0120770  0.0831900 -0.2152800  0.4104500  0.2913600  0.3086900
[183]  0.0788640  0.3220700 -0.0410230 -0.1097000 -0.0920410 -0.1233900 -0.1641600  0.3538200 -0.0827740  0.3317100 -0.2473800 -0.0489280  0.1574600  0.1898800 -0.0266420  0.0633150 -0.0106730  0.3408900  1.4106000  0.1341700  0.2819100 -0.2594000  0.0552670 -0.0524250 -0.2578900  0.0191270
[209] -0.0220840  0.3211300  0.0688180  0.5120700  0.1647800 -0.2019400  0.2923200  0.0985750  0.0131450 -0.1065200  0.1351000 -0.0453320  0.2069700 -0.4842500 -0.4470600  0.0033305  0.0029264 -0.1097500 -0.2332500  0.2244200 -0.1050300  0.1233900  0.1097800  0.0489940 -0.2515700  0.4031900
[235]  0.3531800  0.1865100 -0.0236220 -0.1273400  0.1147500  0.2735900 -0.2186600  0.0157940  0.8175400 -0.0237920 -0.8546900 -0.1620300  0.1807600  0.0280140 -0.1434000  0.0013139 -0.0917350 -0.0897040  0.1110500 -0.1670300  0.0683770 -0.0873880 -0.0397890  0.0141840  0.2118700  0.2857900
[261] -0.2879700 -0.0589960 -0.0324360 -0.0047009 -0.1705200 -0.0347410 -0.1148900  0.0750930  0.0995260  0.0481830 -0.0737750 -0.4181700  0.0041268  0.4441400 -0.1606200  0.1429400 -2.2628000 -0.0273470  0.8131100  0.7741700 -0.2563900 -0.1157600 -0.1198200 -0.2136300  0.0284290  0.2726100
[287]  0.0310260  0.0967820  0.0067769  0.1408200 -0.0130640 -0.2968600 -0.0799130  0.1950000  0.0315490  0.2850600 -0.0874610  0.0090611 -0.2098900  0.0539130

Context-window Size

The size of the context window determines which type of word meaning is represented in the embedding space

Small context windows ($\pm$ 1-3 words) $\rightarrow$ syntactic meaning
- E.g. putting, bringing, taking, giving, providing, etc
Medium context windows ($\pm$ 5-10 words) $\rightarrow$ semantic meaning
- E.g. crimes, crime, offences, offence, prosecutions, murder, etc
Large context windows ($\pm$ 10+ words) $\rightarrow$ topical meaning
- E.g. tourism, visitors, museum, holiday, cafe, etc

$\Rightarrow$ the size of the window will depend on the research question.

Embedding Dimensions

The size of the embedding vectors determines how complex the model is that we wish to fit

We have an embedding for each word, so increasing the embedding dimension by 1 multiplies the number of parameters to estimate by $V$ (the total number of words in the vocabulary)
Too many dimensions: higher chance of modelling noise
Too few dimensions: higher chance of missing important subtleties in meaning
General guidance: about 150-300 is fine (though this is not a very satisfying answer)

What do word-embedding dimensions “mean”?

We can now generate multidimensional vectors for each of our words (we will see shortly) are very successful in capturing semantic relations among words

This implies that a meaningful semantic structure must be present in the respective vector spaces

However, it is very difficult to answer questions such as “what do high and low values of the $i$th embedding dimension mean?”

Example:

rownames(glove)[order(glove[,1], decreasing = F)][1:6]

[1] "samiul"     "stuffit"    "guangwei"   "decompress" "resend"     "sife"

rownames(glove)[order(glove[,22], decreasing = F)][1:6]

[1] "cheesecloth" "globe.com"   "zubaie"      "metrohealth" "25-march"    "30-aug"

rownames(glove)[order(glove[,300], decreasing = F)][1:6]

[1] "republish"     "12,000-page"   "transmittable" "affray"        "6-pica"        "spongiform"

Generating “interpretable” word embeddings is the subject of ongoing work

Local Versus Pre-Trained

Either the Word2Vec or Glove methods can be applied to any large corpus of data. For applied research, there are typically two choices:

Locally-trained embeddings
- Collect a large corpus
- Estimate a word-embedding model
- Use the word-embeddings
- Advantages: Can capture “local” meanings of words which may differ from more general use
- Distadvantages: More computationally expensive and requires more coding decisions/effort

Pre-trained embeddings
- Download a pre-trained set of word-embeddings
- Use the word-embeddings
- Advantages: Usually high-quality embeddings trained on billions of texts
- Disadvantages: May miss local variation in word meaning

Rodriguez and Spirling, 2020 suggest that pre-trained embeddings normally perform equally well as locally-trained variants on most tasks.

Visualisation

It is common to see visualisations of word embeddings in lower dimensions
There are many approaches to dimensionality reduction
- Principal Component Analysis (PCA)
- t-distributed stochastic neighbour embeddings (t-SNE)
These visualisations often give helpful insights into the ways that language is used in the data that was used to train the models

Break

Using Word-Embeddings

Similarity

A key advantage of word embeddings: we can compute the similarity between words (or collections of words)
The similarity between two words can be calculated as the cosine of the angle between the embedding vectors:

\[cos(\theta) = \frac{\mathbf{w}_i \cdot \mathbf{w}_j}{\left|\left| \mathbf{w}_i \right|\right| \left|\left| \mathbf{w}_j \right|\right|}\]

We can then sort the words in order of their similarity with the target word and report the “nearest neighbours”

Similarity Demonstration

library(text2vec)

# Extract target embedding
target <- glove[which(rownames(glove) %in% c("taxes", "quantitative", "enron")),]

# Calculate cosine similarity
target_sim <- sim2(glove,
                   target)

# Report nearest neighbours
sort(target_sim[,1], decreasing = T)[1:10]

    taxes       tax    income    paying  taxation       pay  revenues      fees    excise     costs 
1.0000000 0.8410683 0.6800698 0.6292870 0.6281890 0.6165617 0.5964931 0.5957133 0.5911929 0.5867159

# Report nearest neighbours
sort(target_sim[,3], decreasing = T)[1:10]

 quantitative   qualitative     empirical   measurement      analysis   methodology    analytical      analyses methodologies     numerical 
    1.0000000     0.6452297     0.5144293     0.4902190     0.4807779     0.4792444     0.4602726     0.4536292     0.4420934     0.4269905

# Report nearest neighbours
sort(target_sim[,2], decreasing = T)[1:10]

     enron   worldcom   skilling     fastow     dynegy   andersen executives accounting        aig   auditors 
 1.0000000  0.6551277  0.6350623  0.5476151  0.5255642  0.5219762  0.5188566  0.5134948  0.4963848  0.4742722

Analogies

One surprising feature of word-embeddings is that they can capture more nuanced features of language than simple similarity
Among the most widely discussed features of word embeddings is their ability to capture analogies via their geometry
Analogies are linguistic expressions which describe processes of transfering information from one subject (the analogue) to another (the target)
Example:
- Apple is to tree as grape is to ____
- King is to man as _____ is to woman

Word embeddings have some ability to “solve” analogies of this form using vector addition and subtraction

\[\text{vector(king)} - \text{vector(man)} + \text{vector(woman)} \approx \text{vector(queen)}\]

Example process:

Compute vector $\text{vector(king)} - \text{vector(man)} + \text{vector(woman)}$
Calculate cosine similarity between new vector and all word vectors
Report most similar vectors (normally excluding those for king, man and woman)

Analogies Demonstration

# Extract vectors
king <- glove[which(rownames(glove) == "king"),]
man <- glove[which(rownames(glove) == "man"),]
woman <- glove[which(rownames(glove) == "woman"),]

# Generate analogy vector
target <- king - man + woman

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]

     king     queen   monarch    throne  princess    mother  daughter   kingdom    prince elizabeth 
0.8065858 0.6896163 0.5575491 0.5565375 0.5518684 0.5142154 0.5133157 0.5025345 0.5017740 0.4908031

# Extract vectors
paris <- glove[which(rownames(glove) == "paris"),]
france <- glove[which(rownames(glove) == "france"),]
germany <- glove[which(rownames(glove) == "germany"),]

# Generate analogy vector
target <- paris - france + germany

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]

   berlin frankfurt   germany    munich   cologne      bonn    vienna   hamburg   leipzig    german 
0.8082348 0.7182159 0.6976348 0.6616810 0.6388244 0.6297188 0.6096600 0.6015804 0.5951980 0.5929443

# Extract vectors
taller <- glove[which(rownames(glove) == "taller"),]
tall <- glove[which(rownames(glove) == "tall"),]
thin <- glove[which(rownames(glove) == "thin"),]

# Generate analogy vector
target <- taller - tall + thin

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]

   thinner       thin    thicker     taller    slimmer   narrower noticeably     softer   slightly     weaker 
 0.6823011  0.6795648  0.6409053  0.4766686  0.4659788  0.4645112  0.4491847  0.4438605  0.4299319  0.4269721

Implication: Word-embedding vectors encode certain linguistic regularities that relate to the relation between different words.

Caveats:

Only works with reasonably common words
Only works for certain relations, but not others
Understanding analogy is an open area for research

Dictionary Expansion

One helpful application of the word-similarity properties we have just discussed is that we can use them to automatically build more complete dictionaries
Process
1. Start with a small “seed” dictionary
2. Calculate the average embedding of the words in the dictionary
3. Calculate the cosine similarity between the dictionary embedding and all other words
4. Report the most similar words and use them to extend the original dictionary
This approach can enable us to find words associated with our concept of interest but which may not occur to the research a priori

Dictionary Expansion Application

# Define seed dictionary
seed_dictionary <- c("hate", "dislike", "despise")

# Extract seed words 
hate_words <- glove[which(rownames(glove) %in% seed_dictionary),]

# Calculate mean embedding
hate_words_vec <- colMeans(hate_words)

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(hate_words_vec, nrow = 1))

# Print output
names(sort(target_sim[,1], decreasing = T))[1:40]

 [1] "despise"     "dislike"     "hate"        "loathe"      "hatred"      "hated"       "detest"      "hates"       "disdain"     "distrust"    "adore"       "resent"      "disliked"    "distaste"    "hating"      "dislikes"    "admire"      "loathing"    "feelings"    "bigotry"     "antipathy"  
[22] "affection"   "envy"        "animosity"   "intolerance" "equate"      "abhor"       "despises"    "hostility"   "despised"    "profess"     "admiration"  "criticize"   "liking"      "fear"        "perceive"    "disrespect"  "mistrust"    "disliking"   "hateful"

Applications

Application 1

Are female politicians less aggressive than male politicians? (Hargrave and Blumenau, 2022)

In lecture 2 we investigated the claim that male and female politicians have distinct styles. Previously, we applied an existing sentiment dictionary to a corpus of parliamentary texts. Today, we will supplement this approach by using word-embeddings to automatically expand the set of words we use to score speeches.

Aggressive Word Dictionary

library(quanteda)
aggression_words <- read.csv("aggression_words.csv")[,1]

print(aggression_words)

  [1] "irritated"         "stupid"            "stubborn"          "accusation"        "acuse"             "accusations"       "accusing"          "anger"             "angered"           "annoyance"         "annoyed"           "attack"            "insult"            "insulting"        
 [15] "insulted"          "betray"            "betrayed"          "blame"             "blamed"            "blaming"           "bitter"            "bitterly"          "bitterness"        "complain"          "complaining"       "confront"          "confrontation"     "fibber"           
 [29] "fabricator"        "phony"             "fibber"            "sham"              "deceived"          "deceive"           "disgrace"          "villain"           "good-for-nothing"  "hypocrite"         "deception"         "steal"             "needlessly"        "needless"         
 [43] "criticise"         "criticised"        "criticising"       "blackened"         "fiddled"           "fiddle"            "problematic"       "lawbreakers"       "offenders"         "offend"            "unacceptbale"      "leech"             "phoney"            "appalling"        
 [57] "incapable"         "farcical"          "absurd"            "ludicrous"         "nonsense"          "laughable"         "nonsensical"       "ridiculous"        "outraged"          "hysterial"         "adversarial"       "aggressive"        "shady"             "stereotyping"     
 [71] "unhelpful"         "unnatural"         "assaulted"         "assault"           "assaulting"        "half-truths"       "petty"             "humiliate"         "humiliating"       "confrontational"   "hate"              "hatred"            "furious"           "hostile"          
 [85] "hostility"         "nasty"             "obnoxious"         "sleeze"            "sleezy"            "inadequacy"        "faithless"         "neglectful"        "neglect"           "neglected"         "wrong"             "failure"           "failures"          "failed"           
 [99] "fail"              "scapegoat"         "cruel"             "cruelty"           "demonise"          "demonised"         "tactic"            "trick"             "trickery"          "deceit"            "dishonest"         "deception"         "devious"           "deviouness"       
[113] "shenanigans"       "fraudulence"       "fraudulent"        "fraud"             "swindling"         "archaic"           "sly"               "slyness"           "silly"             "silliness"         "scandal"           "scandalous"        "slander"           "slanderous"       
[127] "libellous"         "disreputable"      "dishonourable"     "shameful"          "atrocious"         "gimmick"           "immoral"           "ridicule"          "antagonistic"      "antagonise"        "ill-mannered"      "spiteful"          "spite"             "vindictive"       
[141] "prejudice"         "prejudices"        "disregard"         "arrogant"          "arrogance"         "embarrasment"      "embarrass"         "embarrasing"       "distasteful"       "provoke"           "provoked"          "petulant"          "ignorance"         "stupidity"        
[155] "idiot"             "idiotic"           "annoying"          "dodgy"             "untrue"            "penny-pinching"    "attacking"         "ironic"            "irony"             "outrageous"        "hackery"           "crass"             "backchat"          "rude"             
[169] "ill-judged"        "ragbag"            "mess"              "hash"              "fiasco"            "shambles"          "shambolic"         "farce"             "botch"             "botched"           "blunder"           "mischievous"       "mischief"          "undermine"        
[183] "straightjacket"    "groan"             "abuse"             "chaos"             "chaotic"           "dull"              "predictable"       "negligent"         "grotesque"         "scapegoats"        "hypocrisy"         "bogus"             "counterproductive" "betrayal"         
[197] "patronise"         "patronising"       "reprehensible"     "fool"              "foolish"           "abysmal"           "disgraceful"       "woeful"            "inferior"          "sneaky"            "scaremongering"    "scaremonger"       "coward"            "cowardly"         
[211] "ignorant"          "intolerant"        "unacceptbale"      "condemn"           "short-sighted"     "ashamed"           "falsehood"         "blackmail"         "clownery"          "debased"           "debase"            "hypocracy"         "mislead"           "misleading"       
[225] "smokescreen"       "subterfuge"        "horrendous"        "despicable"        "deplorable"

Although this is a reasonable-looking list of aggressive words, are there other words that MPs might use to criticise each other in parliamentary debate?

Estimating Word Embeddings

In this instance, we will use the Glove model to estimate a local set of word-embeddings
The hope is that this will allow us to pick up on the ways in which aggressive words are used in the specific context of parliamentary debate

library(text2vec)

# Load data

load("debates_fcm.Rdata")

## Fit GLOVE model

glove = GlobalVectors$new(rank = 150, x_max = 2500L)
debate_main = glove$fit_transform(debates_fcm, n_iter = 500, convergence_tol = 0.005, n_threads = 3,   learning_rate = 0.14)

## Extract word embeddings
debate_context = glove$components
  
word_vectors = debate_main + t(debate_context)

save(word_vectors, file = "word_vectors_150.Rdata")

str(word_vectors)

 num [1:14683, 1:150] 0.0983 0.1094 0.2122 0.1579 0.1368 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:14683] "house" "proceeds" "choice" "speaker" ...
  ..$ : NULL

Incorporating Word Embeddings

With our word-embeddings in hand, we can then use them to create a dictionary embedding by averaging over the embeddings for each word:

## Extract word embeddings of words in dictionary
target_words <- word_vectors[aggression_words,] 

## Calculate mean embedding for this dictionary
target_vector <- colMeans(target_words)

## Distance between each word in the vocabulary and the mean embedding
cos_sim <- sim2(word_vectors, 
                matrix(target_vector, nrow = 1)) 

## Store results
word_scores <- data.frame(score = cos_sim[,1], 
                          in_original_dictionary = dimnames(cos_sim)[[1]]%in%aggression_words)

Incorporating Word Embeddings

word_scores <- word_scores[order(word_scores$score, decreasing = T),]
head(word_scores, 30)

                   score in_original_dictionary
disgraceful    0.6828765                   TRUE
shameful       0.6611104                   TRUE
outrageous     0.6553395                   TRUE
scaremongering 0.6348777                   TRUE
utterly        0.6147277                  FALSE
cynical        0.6142637                  FALSE
frankly        0.6091110                  FALSE
scandalous     0.6077674                   TRUE
dishonest      0.6039909                   TRUE
embarrassing   0.5921001                  FALSE
absurd         0.5897929                   TRUE
ridiculous     0.5887514                   TRUE
ludicrous      0.5873914                   TRUE
deplorable     0.5846311                   TRUE
incompetence   0.5773249                  FALSE
misguided      0.5683095                  FALSE
irresponsible  0.5675159                  FALSE
pathetic       0.5667197                  FALSE
appalling      0.5536031                   TRUE
dreadful       0.5514060                  FALSE
nonsense       0.5435853                   TRUE
bizarre        0.5403646                  FALSE
complacency    0.5319133                  FALSE
ashamed        0.5275484                   TRUE
illogical      0.5250856                  FALSE
arrogant       0.5205944                   TRUE
incompetent    0.5176657                  FALSE
shocking       0.5158744                  FALSE
accusation     0.5158350                   TRUE
arrogance      0.5152852                   TRUE

Scoring Speeches

In addition to using this approach to finding words we might have missed, we now have scores associated with each word that indicate the relevance of the word to the concept of interest

We can use these word-weights to score individual speeches

\[Score_i = \frac{\sum_w^W Sim_w N_{w,i}}{\sum_w^W N_{w,i}}\] - $Sim_w$ is the similarity score for each word embedding and the dictionary embedding

$N_{w,i}$ is the tf-idf count of the word in the speech
$Score_i$ therefore represents the fraction of words in sentence $i$ that are relevant to the concept contained in the seed dictionary

Key advantage: speech scores reflect the ways that aggressive words are used in the context of parliamentary debate

Comparison with Traditional Dictionaries

Application 2

How do words change in meaning over time? (Hamilton et. al., 2018)

Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Hamilton et. al. estimate (co-occurence vectors and Word2Vec) word-embeddings on the Google Books corpus to evaluate changes in word-meaning over 2 centuries.

Frequent words change meaning at a slow rate, rare words change meaning faster
Words with multiple meanings change at a fast rate, words with single meanings change slower

Extensions

Bias in Word-Embeddings

An important substantive finding about word-embedding methods is that they can learn human biases in the semantic relationsips they encode into the vector space

This occurs because they are trained on human-generated data: if biased relations between words occur frequently in natural language texts, the word-embeddings learn those biases

“There is nothing about doing data analysis that is neutral. What and how data is collected, how the data is cleaned and stored, what models are constructed, and what questions are asked – all of this is political.” Danah Boyd, NYU

Another important theme of current work is in “de-biasing” word-embedding methods

Bias in Word-Embeddings, Example

# Extract vectors
doctor <- glove[which(rownames(glove) == "doctor"),]
father <- glove[which(rownames(glove) == "father"),]
mother <- glove[which(rownames(glove) == "mother"),]

# Generate analogy vector
target <- (doctor - father) + mother

# Calculate cosine similarity with all other vectors
target_sim <- sim2(glove,
                   matrix(target, nrow = 1))

# Print output
sort(target_sim[,1], decreasing = T)[1:10]

   doctor     nurse   doctors     woman   patient    mother physician  pregnant  hospital   medical 
0.8397708 0.6648028 0.6255664 0.5923487 0.5839312 0.5719679 0.5527085 0.5417390 0.5404372 0.5336439

Racial Bias in Word-Embeddings (Garg et al., 2018, PNAS)

Polysemes

A polyseme is a word or phrase that has multiple meanings

For example:
- Pike
  - A sharp point or staff
  - A type of elongated fish
  - A railroad line
  - The future (coming down the pike)
  - A type of body position (as in diving)
  - Etc

We typically have a single word-embedding vector for each word, which will be a (weighted) average of the contexts in which these different usages occur

Contextual Word Embeddings (BERT; ELMo) are one potential solution: estimate a separate vector for every token, not just every type

Contextual Word Embeddings

Rodriguez et. al., 2023 provide a demonstration of combining regression methods with contextual word-embeddings to answer substantive social science research questions.

Nearest Neighbours of “Equality” in 1885 and 2005
Equality: 1885	Equality: 2005
enactment	gender
abolition	gays
slavery	lesbians
amendment	transgender
abrogation	lgbt

Nearest Neighbours of “Trump” and “trump”
Trump	trump
president	declarer
impeaching	trumps
assailing	colloquies
president-elect	four-point
impeach	upend

Contextual Embeddings and Modern NLP Methods

Modern NLP models (which we will continue to study over the next three weeks) are all built upon the idea of contextual embeddings which learn word meanings dynamically based on context.

Traditional Embeddings	Contextual Embeddings
One vector per word	Different vectors per word use
Meaning is fixed	Meaning depends on sentence
Word similarity is static	Word relationships change dynamically

Conclusion

Summing Up

Word embedding methods provide a representation of word “meaning” by encoding information about the contexts in which words occur
These vector result in a rich representation that allow us to measure the similarity between different words
There are several modelling decisions to make when estimating word embeddings, including modelling approach, context-window size, and embedding length
Embeddings improve many NLP tasks – classification, topic modelling, etc – by allowing generalisation and semantic reasoning
Over the next three weeks, embeddings will be a key building block of more advanced models

\(t\)	\(o\)	\(Y\)
UN	to	1
UN	get	1
UN	agreement	1
UN	as	1