8 Representing Text as Data (III): Word Sequences – PUBL0099: Quantitative Text Analysis for Social Science

8.1 Lecture slides

8.2 Sentence and Document Embeddings

In seminar 7, we learned how dense vector representations of words (word embeddings) provide a better representation of word meaning, that can be used to find synonyms, model analogies, and measure concept prevalence with dictionary expansion. This week, we are going to generate dense vector representations of whole text sequences (sentences, paragraphs, documents). This is will be useful for a number of tasks - e.g. text classification and clustering - which require a single vector representation of each text.

In today’s seminar, we will focus on two approaches: aggregating word embeddings and sentence-BERT embeddings. We will applies these methods to part of the Twitter corpus introduced in the lecture and then use these embeddings for clustering and information retrieval.

8.3 Packages

You will need to load the following packages before beginning the assignment

library(stringr)
library(dplyr)
library(data.table)
library(quanteda)
library(text2vec)
# If you cannot load these libraries, try installing them first. E.g.: 
# install.packages("stringr")

8.4 Data

Today’s seminar will rely on 3 data sources.

GloVe embeddings: you used these last week, so may already have them downloaded. If not, you can download them from the link at the top of the seminar page (this may take a few minutes).
Greenpeace tweets: we’ll be using a subset of the Twitter data described in the lecture - just the tweets of Greenpeace UK (an environmental protest and advocacy group) between 2010 and 2020. NB: ‘seminar data 2’ will download two objects - an R data file and a csv.
Guardian articles: finally, we’ll search for news article segments from the period that appear to replicate Greenpeace’s claims. NB: ‘seminar data 3’ will download two R data files.

8.5 Part 1: Aggregating word embeddings

We’ll start by generating dense vector representations of each tweet by aggregating the word embeddings of each token.

Load in the the tweets and the pretrained GloVe embeddings.

Reveal code

load("glove_embeddings.Rdata")
load("greenpeace_tweets.Rdata")

Find the set of unique words used in the corpus of tweets. To do this, you’ll need to first tokenise the tweets (which you could do using the quanteda package) and then use unlist() followed by unique() to recover word-types.

Reveal code

vocab_corpus <- greenpeace_tweets$tweet_text %>%
  tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
  tokens_tolower() %>%
  unlist() %>%
  unique()

Find the intersection between the set of unique words in the corpus of tweets and the vocabulary of the GloVe embeddings. Note that the row names of glove stores the word-type as a string. What proportion of word-types in the corpus of tweets are ‘Out Of Vocabulary’ (OOV)?

Reveal code

# Find vocabulary of glove embeddings
vocab_glove <- rownames(glove)

# Find the intersection
vocab_intersection <- intersect(vocab_corpus, vocab_glove)

# How many OOV words are there?
length(vocab_corpus) - length(vocab_intersection)

[1] 19860

# What proportion of *word-types* in the Twitter corpus are OOV?
1 - (length(vocab_intersection)/length(vocab_corpus))

[1] 0.5172144

# What kinds of words are OOV?
vocab_corpus[!(vocab_corpus %in% vocab_intersection)][1:10]

 [1] "here's"          "@patrickaryee"   "you're"          "you've"         
 [5] "@greenpeace"     "we've"           "@watchingmytone" "we'll"          
 [9] "that's"          "@glass_ambrose"

That’s a lot of OOV word-types! Of course, a lot of these are usertags and hastags - but these may still contain useful information. Where these words appear in a tweet, we are forced to ignore them when constructing a tweet level embedding using this method. We are constrained in this case by the vocabulary of the GloVe, which despite being enormous (400,000) is still not exhaustive. If we wanted to improve on this, we could: (a) train our own word embeddings using this corpus (though it might be too small to learn meaningful representations of the rarer words); or (b) rely on a word embedding model that uses subword tokenisation (e.g. FastText).

Next let’s create a document-feature matrix of the tweets that aligns with the GloVe embeddings.

You can do this with the following steps:

Tokenise the tweets in exactly the same way that you did when discovering the set of unique words in question 1.
Create a document feature matrix from the tokenised tweets, using dfm() from quanteda.
Use dfm_match() to align the columns with vocab_intersection.
Select the relevant rows from the glove matrix, to match up with the columns of your tweet dfm.

Once you’ve completed these steps, the number of columns in your tweet dfm should equal the number of rows in your glove matrix. You can check this with dim().

Reveal code

# Create a dfm from the Twitter corpus
gp_dfm <- greenpeace_tweets$tweet_text %>%
  tokens() %>%
  tokens_tolower() %>%
  dfm()

# Align this with vocab_intersection
gp_dfm_matched <- dfm_match(gp_dfm, vocab_intersection)

# Align the embedding matrix
glove_matched <- glove[vocab_intersection,]
rm(glove)

# Compare the dimensions of the dfm and the embedding matrix
# columns of dfm should equal rows of embedding matrix!
dim(gp_dfm_matched)

[1] 38360 18538

dim(glove_matched)

[1] 18538   300

Good news: the number of columns of gp_dfm (number of word-types) are equal to the number of rows in glove_matched. This means that the two matrices are ‘conformable’ - i.e. they can be matrix multiplied together. This is going to be useful for the next question.

Now let’s generate tweet-level embeddings by aggregating the word embeddings. We can do this using matrix multiplication.

Intuitively, we want to count word-types that are repeated more than once. So instead of just summing/averaging the embeddings of the word-types in a tweet, we want to sum/average the embeddings of the word-tokens in a tweet. This is the equivalent to calculating a weighted sum/average of the word-types, weighted by their frequency. With a single document and a single embedding dimension, we could achieve this by simply multiplying its row of the dfm by the embedding dimension column (e.g. dfm_matched[1,] * glove_matched[,1]). But where there are many documents and many embedding dimensions, it is more computationally efficient to use matrix multiplication: dfm_matched %*% glove_matched.

Run this now and inspect the dimensions of the resulting object.

Reveal code

# Calculate sentence embeddings through addition
# This can be implemented using matrix multiplication
aggregated_static_emb <- as.matrix(gp_dfm_matched %*% glove_matched)

# Check the dimensions of the resulting object
dim(aggregated_static_emb)

[1] 38360   300

The resulting object has 38,360 rows (i.e. the number of tweets) and 300 columns (i.e. the number of GloVe dimensions).

Normalise the embeddings by dividing each row by its magnitude

Recall that the magnitude of a vector is its Euclidean distance from the origin. This is measured as the square root of the sum of the squared value on each dimension (sqrt(sum(v^2))). To calculate the magnitude of every row of a matrix m, we can use rowSums() as follows: sqrt(rowSums(m^2)).

Reveal code

# Normalise the embeddings
aggregated_static_emb_normalised <- aggregated_static_emb/sqrt(rowSums(aggregated_static_emb^2))

Find the most semantically similar tweets to the first tweet in the corpus

Start by inspecting the first tweet.

greenpeace_tweets$tweet_text[1]

[1] "Thinking of eating less meat in the new year? \n \nHere’s some inspiration from 4 world cultures and religions that embrace plant-based eating. 🥕🥦🌶️🌽🍆\n \nhttps://t.co/zj0ESkrNsc"

Now calculate the cosine similarity between this tweet and every other tweet in the corpus, using similar code to what you used last week to find the most similar word. You will want to make use of sim2() from the text2vec package again.

Then, retrieve the 5 most similar tweets to the target tweet.

Reveal code

# Calculate cosine similarity between tweet 1 and every other tweet
similarity_to_tweet1 <- sim2(aggregated_static_emb_normalised[-1,], matrix(aggregated_static_emb_normalised[1,], nrow=1))

# Extract the 5 most similar tweets
greenpeace_tweets$tweet_text[-1][order(similarity_to_tweet1, decreasing = TRUE)][1:5]

[1] "@im_lowkey_loki There's plenty we can all do, from taking part in campaigns to eating less meat and more wonderful plant based food https://t.co/P5W0lX523V 🙂"                                                                                                                                 
[2] "@judealdridge Many people around the world rely on meat and fish for protein.  Our campaign encourages everyone to eat less meat, making it more inclusive, and we're focusing on tackling industrial and unsustainable farming rather than consumers. Find out more at https://t.co/EAihtXB4Jw"
[3] "@MarsDroid1 It's far less efficient to grow crops for farm animal feed than as food for people as this blog and video explains https://t.co/0V1HO3T0i0 so eating less meat and eating a wide range of plant-based food is great for forests and the climate."                                   
[4] "@MissKay_Geog Our health, the stability of the climate and the future of the world’s forests depends on us eating less meat and dairy. By eating mostly plant based food, we could feed more people with all the calories and nutrition needed for a healthy diet without destroying forests 🌳"
[5] "Happy #WorldVeganDay 🎉🥦🥑💚🍉🥝🎉\n\nMore and more people in the UK are eating less meat. \nWhether you're doing it to protect the environment, for the animals or for your own health - well done! 👏👏\n\nhttps://t.co/IHYkp3FYOs"

Do these tweets seem relevant to the target tweet?

8.6 Part 2: Sentence-BERT Embeddings

To generate sentence-BERT embeddings, we need to access the pre-trained sentence-BERT model. The easiest way to do this is through the HuggingFace API. We’ll talk more about this next week, but in essence, HuggingFace is an online repository of pretrained neural network models and datasets. You can access these models using a specially designed API. At present, there is no good implementation for this in R, and it requires a small amount of Python code.

Next week, we will introduce you to an easy to use interface for Python (Google Colab). For the purposes of this seminar, we have provided you with pre-computed sentence-BERT embeddings to analyse in R, meaning that you don’t actually need to run any Python code. But for your own understanding, have a look at the code below to see how the embeddings were generated.

The first step (in R) is to export the tweets as a csv file.

write.csv(greenpeace_tweets, file = "greenpeace_tweets.csv", fileEncoding = 'UTF-8', row.names = FALSE)

Next (in Python), install and import the following dependencies:

pip install pandas
pip install sentence_transformers

import pandas as pd
from sentence_transformers import SentenceTransformer

Then download your preferred sentence-BERT model. Recall from the lecture that there are many to choose from, varying in terms of model size, computation time, performance on various datasets, etc. In this seminar we will use all-MiniLM-L6-v2, which is a fairly fast model that nevertheless performs relatively well for producing general purpose sentence embeddings.

model_preferred = SentenceTransformer('all-MiniLM-L6-v2')

Now we can import the tweets and encode them as sentence embeddings using the following code:

# Import tweets as data frame
greenpeace_tweets = pd.read_csv('greenpeace_tweets.csv')

# Extract the column containing the tweet text
tweets = greenpeace_tweets['tweet_text']

# Encode as sentence embeddings
sentence_embeddings = model_preferred.encode(tweets, show_progress_bar = True)

# Export as a csv file
pd.DataFrame(sentence_embeddings).to_csv('greenpeace_sentence_embeddings.csv', index=False, header=False)

And that’s it! Because we already ran that code for you, you can simply download the greenpeace_sentence_embeddings.csv file by clicking the link above.

Import the precomputed sentence embeddings into R

To speed this up, use fread() from the data.table() library.

sbert_emb <- fread(file = "greenpeace_sentence_embeddings.csv")
sbert_emb <- as.matrix(sbert_emb)

Check that the embeddings have been normalised

The particular SBERT model that we use to produce these embeddings automatically applies normalisation. But we can check this, by calculating the magnitude of each vector, which should equal 1. You should be able to recycle code from question 6 in the first part to do this.

Reveal code

# Randomly sample 100 rows
sbert_emb_sample <- sbert_emb[sample(1:nrow(sbert_emb), 100),]

# Calculate the magnitude of every row
sqrt(rowSums(sbert_emb_sample^2))

  [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
  [8] 1.0000001 1.0000000 1.0000001 1.0000000 1.0000000 1.0000000 1.0000000
 [15] 1.0000000 1.0000001 1.0000000 1.0000000 0.9999999 1.0000000 1.0000000
 [22] 1.0000001 1.0000000 0.9999999 1.0000000 1.0000000 1.0000000 1.0000000
 [29] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000001 1.0000000
 [36] 1.0000000 1.0000000 1.0000001 1.0000000 1.0000000 1.0000000 1.0000001
 [43] 1.0000000 0.9999999 1.0000000 1.0000001 1.0000000 1.0000000 1.0000000
 [50] 1.0000001 1.0000000 1.0000000 0.9999999 1.0000000 1.0000001 1.0000000
 [57] 1.0000000 1.0000000 1.0000000 1.0000000 0.9999999 1.0000000 1.0000000
 [64] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9999999
 [71] 1.0000000 1.0000000 1.0000000 0.9999999 1.0000000 1.0000000 1.0000000
 [78] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [85] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [92] 1.0000000 1.0000000 1.0000000 1.0000001 1.0000000 1.0000000 1.0000000
 [99] 1.0000000 1.0000000

All values are equal to or very close to 1, which suggests the embeddings have indeed been normalised. The small deviations from 1 are due to minor computational inaccuracies.

Find the most semantically similar tweets to the first tweet in the corpus using these SBERT embeddings

Repeat the steps from question 7 in the first part but this time using sbert_emb. How do the results compare to using the aggregated word embeddings? Do these tweets match the first tweet more closely?

Reveal code

# Inspect the first tweet
greenpeace_tweets$tweet_text[1]

[1] "Thinking of eating less meat in the new year? \n \nHere’s some inspiration from 4 world cultures and religions that embrace plant-based eating. 🥕🥦🌶️🌽🍆\n \nhttps://t.co/zj0ESkrNsc"

# Calculate cosine similarity between tweet 1 and every other tweet
similarity_to_tweet1_sbert <- sim2(sbert_emb[-1,], matrix(sbert_emb[1,], nrow=1))

# Extract the 5 most similar tweets
greenpeace_tweets$tweet_text[-1][order(similarity_to_tweet1_sbert, decreasing = TRUE)][1:5]

[1] "@VeganChatRoom Greenpeace always encourage people to eat less meat, but we think January is a great time to make a change, especially if the change is good for the planet! 👍"
[2] "@BradleyAllsop2 we promote eating less meat, sustainable agriculture http://t.co/KZAapXvCQi and wld vegetarian day http://t.co/0ug992U4jL"                                     
[3] "@im_lowkey_loki There's plenty we can all do, from taking part in campaigns to eating less meat and more wonderful plant based food https://t.co/P5W0lX523V 🙂"                
[4] "RT @Greenpeace: Happy #WorldVeganDay! Eating a plant-based diet is good for people, animals &amp; the planet. Here are tips for newbies\nhttps:/…"                             
[5] "@claireperrymp 10: Radically change the farming and food system to encourage a less meat-based diet #ClimateEmergency \n\nhttps://t.co/ewrq3gHj3q"

These tweets are a little more similar to the first tweet than those retrieved using aggregated word embeddings. All of them relate to eating less meat in both case. But here, the first most similar tweet is about eating less meat at new year specifically.

8.7 Part 3: k-means Clustering

To discover the key claims made by Greenpeace in this time period, we can now cluster the tweets using the sentence embeddings. In the lecture, we introduced the kmeans clustering algorithm. In this section, we will implement this in R and extract the most ‘central’ tweets in each cluster to serve as representative examples for labelling.

Applying k-means clustering to the full sample of tweets.

Use the kmeans() function in base R. To do this, you will need to select a value of k (i.e. the number of clusters).

Reveal code

# Kmeans clustering
set.seed(20)
k = 50
kmeans_out <- kmeans(sbert_emb, centers = k, iter.max = 20, nstart = 1)

Extract the most central tweets in each cluster.

A convenient property of kmeans is that the observations located nearest to the centroid of the cluster can be considered the most ‘representative’ of the whole cluster. We can therefore label the clusters by observing the most central tweets in the cluster. Complete the code below to create a list of length k, where each element contains the 10 most central tweets in a given cluster.

Reveal code

# Find the most central tweets in each cluster
most_central <- list()

for (j in 1:k) {
  
  # Subset sbert_emb to just the embeddings for cluster j
  cl_emb <- sbert_emb[kmeans_out$cluster == j, ]
  
  # Subset greenpeace_tweets to just the text of tweets in cluster j
  cl_text <- greenpeace_tweets$tweet_text[kmeans_out$cluster == j]
  
  # Calculate the cosine similarity between each tweet in cluster j and the centroid of cluster j
  cl_cos <- sim2(cl_emb, matrix(kmeans_out$centers[j,], nrow=1))
  
  # Extract the 10 most central tweets in cluster j
  most_central[[j]] <- cl_text[order(cl_cos, decreasing=TRUE)[1:10]]
  
}

Inspect the clusters and see if you can come up with sensible labels

e.g. Inspect the 6th cluster with most_central[[6]].

Reveal code

most_central[[6]]

 [1] "Fed up of plastic waste? \n\nTake action against plastic pollution, tell your supermarket to ditch plastic packaging! \n\nAdd your name &gt;&gt; https://t.co/0MXev4T1d3"                                                                                                                                              
 [2] "Great to see @CatherineWest1 @SueHayman1 &amp; 200 other MPs calling on supermarkets to eliminate plastic packaging by 2023. Let's hope they rise to the challenge! #EndOceanPlastics https://t.co/wNOauUk2Hv"                                                                                                         
 [3] "RT @thisisfreegle: trying to go #PlasticFree\nHow are the Supermarkets doing at tackling the plastic problem?\nCheck out the @greenpeaceuk 20…"                                                                                                                                                                        
 [4] "Thanks to our amazing volunteers around the UK helping #shoppersrevolt against excessive plastic. \n\nIf you shop at @Tesco, @Morrisons , @AldiUK, @LidlUK, @asda, @coopuk, @waitrose @sainsburys  or @marksandspencer - tell them to ditch excessive plastic &gt;&gt; https://t.co/o17QYYuJ8Q https://t.co/7Bko4zJxnm"
 [5] "The usual suspects have made \"zero progress\" on reducing plastic waste! 😡 \n\nThese companies need to show much more ambition to reduce plastic packaging.\n\nhttps://t.co/S5nnEudqSE"                                                                                                                              
 [6] "ICYMI: Our amazing volunteers were around the UK helping #shoppersrevolt against excessive plastic. \n\nIf you shop at @Tesco, @Morrisons , @AldiUK, @LidlUK, @asda, @coopuk, @waitrose @sainsburys  or @marksandspencer - tell them to ditch excessive plastic &gt; https://t.co/o17QYYuJ8Q https://t.co/DpL7xHra3G"  
 [7] "Let's hope next year's word is \"reuse\"! 🤞🐳\n\nSign the petition &amp; tell supermarkets to ditch single-use plastic packaging\n🍎🌽🥦🍌 &gt;&gt; https://t.co/Bacd0VSoPh #EndOceanPlastics https://t.co/vPd9xZyqx6"                                                                                                
 [8] "RT if you agree! Tell supermarkets to ditch pointless plastic packaging &gt;&gt; https://t.co/Vt4aSaKYNh #EndOceanPlastics https://t.co/V1roWJpieY"                                                                                                                                                                    
 [9] "@SaintSouthside We're targeting companies and Governments too, of course, but we all need to take responsibility for reducing the vast and unsustainable amounts of plastic waste currently produced, including consumers.  Refillable cups &amp; bottles are easily found in supermarkets and pound shops"            
[10] "After thousands of people took action in-store and online, @sainsburys have promised to reduce their plastic footprint by 50% by 2025.\n\nThis could be a big move in the fight against plastic pollution!\n\n#PeoplePower\n\nMore info&gt; \nhttps://t.co/l4tDcvSBnB"

8.8 Homework: Find new instances of claims in the news

Now that we have a sense of what it is that Greenpeace are saying on a platform they control (their own Twitter account), let’s see if their claims get replicated in the media. Download the third set of data files by clicking the link above. This should download two files (1) a corpus of news articles - broken into two-sentence segments - published in the year 2015 from the Guardian, mentioning environmental keywords; and (2) precomputed SBERT embeddings for those segments. Start by loading the data in.

Reveal code

load("guardian_segments.Rdata")
load("guardian_segment_embeddings.Rdata")

Your task is to find segments from the news articles that are highly similar to claims made by Greenpeace on Twitter. From the kmeans clustering above, you have a set of centroids (one for each cluster) and by inspecting the most central tweets in the cluster, you can give a label to each cluster. To find a segment in the news corpus that is similar to one of those clusters, you just need to compare the centroid of your chosen cluster to the embeddings of each news article segment, and pick the segments with the highest cosine similarity. In your answer you should: (1) report the most central tweets in your chosen cluster, and how you have decided to label that cluster; (2) report the 5 most similar news article segments to that cluster; and (3) discuss whether your findings suggest that the particular claim you selected was replicated in the news in the period covered by the corpus.

Hint: you can retrieve the centroid of cluster j using kmeans_out$centers[j,]. You can also recycle code for question 3 in part 2.