8: Representing Sequences of Words as Data

Michael Jacobs

Motivation and Overview

Recap

Last week, we moved from sparse to dense representations of words

\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & ...& 0 \end{bmatrix} \end{align}\]

\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0.63 & 0.14 & 0.02 & -0.58 & ...& -0.66 \end{bmatrix} \end{align}\]

This came with several advantages:

  • Better at capturing the meanings of words
  • Allows for “automatic generalisation” (do more with less text)

Can we do the same thing now with sequences of words?

Motivating Problem


Do protest groups’ demands get replicated in the media?

Just Stop Oil activist:

This has sparked millions upon millions of conversations worldwide and I think that if even a tenth of those conversations mentioned new licences for oil and gas, that’s worth it.

How can we test this?

Data:

  • Corpus of approx. 1 million tweets by UK environmental protest groups

  • Corpus of approx. 130,000 UK news articles about the environment

Measurement tasks:

  1. Cluster the tweets to discover the claims

  2. Find new instances of claims in the news

Mapping sentences into a semantic space


Recall that word embeddings mapped synonymous or interchangeable words close together.

What is a meaningful semantic space for sentences?

Mapping sentences into a semantic space


Recall that word embeddings mapped synonymous or interchangeable words close together.

What is a meaningful semantic space for sentences?

  1. Paraphrasings

Mapping sentences into a semantic space


Recall that word embeddings mapped synonymous or interchangeable words close together.

What is a meaningful semantic space for sentences?

  1. Paraphrasings

  2. Entailment/contradiction

Mapping sentences into a semantic space


Recall that word embeddings mapped synonymous or interchangeable words close together.

What is a meaningful semantic space for sentences?

  1. Paraphrasings

  2. Entailment/contradiction

  3. Question-answer relevance

Running Example


Which of these sentences ought to be clustered together?


A. “The attacker’s late goal won the match for his side.”

B. “The striker’s contribution was crucial to the team’s victory.”

C. “The striker was critical to the success of the team.”

D. “The manager was critical of how the striker played.”

E. “The coach fiercely condemned the striker’s performance.”

Running Example


Which of these sentences ought to be clustered together?


A. “The attacker’s late goal won the match for his side.”

B. “The striker’s contribution was crucial to the team’s victory.”

C. “The striker was critical to the success of the team.”

D. “The manager was critical of how the striker played.”

E. “The coach fiercely condemned the striker’s performance.”


The ‘correct’ clustering:

  • A, B and C all attribute credit to the striker

  • D and E both criticise the striker

Running Example

We should expect a cosine similarity matrix that looks like:

Roadmap


Bag-of-words approaches

  1. Term frequency representation

  2. Aggregate word embeddings

Beyond bag-of-words

  1. Subword tokenisations

  2. Contextualised embeddings

  3. Sentence embeddings

Applications

  1. Clustering

  2. Information retrieval

Bag-of-Words Approaches

Term frequency representation: Recap

Represent each document with a vector of length \(V\) (number of words in vocabulary), representing the count of each unique word-type.

library(quanteda)

dfm1 <- sentences %>%
  tokens() %>%
  tokens_wordstem() %>%
  dfm()

dfm1
Document-feature matrix of: 5 documents, 27 features (67.41% sparse) and 0 docvars.
       features
docs    the attack late goal won match for his side .
  text1   2      1    1    1   1     1   1   1    1 1
  text2   2      0    0    0   0     0   0   0    0 1
  text3   3      0    0    0   0     0   0   0    0 1
  text4   2      0    0    0   0     0   0   0    0 1
  text5   2      0    0    0   0     0   0   0    0 1
[ reached max_nfeat ... 17 more features ]


Calculate the cosine similarity of each pair of rows:

cosine_sim_matrices <- function (matA, matB) {
  matA %*% t(matB) / sqrt( rowSums(matA^2) %*% t(rowSums(matB^2)))
}

cosine_sim_matrices(dfm1, dfm1)

Term frequency representation: Running example

The cosine similarity matrix looks like:


  1. “The attacker’s late goal won the match for his side.”

  2. “The striker’s contribution was crucial to the team’s victory.”

  3. “The striker was critical to the success of the team.”

  4. “The manager was critical of how the striker played.”

  5. “The coach fiercely condemned the striker’s performance.”


  1. “The attacker’s late goal won the match for his side.”

  2. “The striker’s contribution was crucial to the team’s victory.”

  3. “The striker was critical to the success of the team.”

  4. “The manager was critical of how the striker played.”

  5. “The coach fiercely condemned the striker’s performance.”

Recall, it should look like:

Aggregated word embeddings: Intro

Word embeddings can help with sentences that use synonymous but different words, e.g.

  • “The attacker’s late goal won the match for his side.”
  • “The striker’s contribution was crucial to the team’s victory.”

How can we convert the dense representation of each word used…

V1 V2 V3 V4 V5 V6 V7 ... V300
the 0.047 0.213 -0.007 -0.459 -0.036 0.236 -0.288 ... 0.054
attacker -0.557 0.233 -0.157 0.29 0.371 -0.003 -0.201 ... -0.131
late 0.447 -0.413 -0.353 0.69 0.281 -0.114 -0.485 ... 0.067
goal 0.552 1.332 -0.476 0.117 0.129 -0.217 -0.481 ... -0.176
... ... ... ... ... ... ... ... ... ...
side 0.097 0.717 -0.119 -0.408 0.016 -0.224 -0.086 ... 0.095

… into a single dense representation of the whole sentence?

V1 V2 V3 V4 V5 V6 V7 ... V300
Sentence 1 0.007 0.045 0.019 -0.003 0.1 0.105 0.1 ... -0.017

Aggregated word embeddings: Intuition

We saw last week that arithmetic operations on word vectors produce meaningful results…

\[\text{vector(king)} - \text{vector(man)} + \text{vector(woman)} \approx \text{vector(queen)}\]

We could just add up all the words in the sentence to represent the whole sentence…

\[\text{vector(The)} + \text{vector(striker)} + \text{vector(was)} + \text{vector(critical)} + ... \approx \text{vector(sentence)}\]

Aggregated word embeddings: Detail

Aggregation methods:

  • Weighted sum: \(\mathbf{d} = \sum_{i=1}^V \mathbf{w}_i \times f_i\)
  • Weighted average: \(\mathbf{d} = (\sum_{i=1}^V \mathbf{w}_i \times f_i)/(\sum_{i=1}^V f_i)\)

Should we sum or average?


Normalisation:

  • The magnitude of a vector is its Euclidean distance from the origin: \(||\mathbf{d}||= \sqrt{\sum_{j=1}^{\mathcal{D}}d_j^2}\)
  • Normalise a vector by dividing by its magnitude: \(\mathbf{d}_{norm} = \frac{\mathbf{d}}{||\mathbf{d}||}\)

Summing will represent sentences with more words further from the origin.


Normalising ensures that every sentence is the same distance from the origin (whether we summed or averaged)


Aggregated word embeddings: Running example

The cosine similarity matrix looks like:


  1. “The attacker’s late goal won the match for his side.”

  2. “The striker’s contribution was crucial to the team’s victory.”

  3. “The striker was critical to the success of the team.”

  4. “The manager was critical of how the striker played.”

  5. “The coach fiercely condemned the striker’s performance.”

Recall, it should look like:

The Limits of ‘Bag-of-Words’


  1. Rare / Out-of-vocabulary words (OOV)

    • Some words do not appear frequently enough to learn a good representation of
    • e.g. domain-specific words, unusual prefixes or suffixes, slang, typos
  1. Polysemy: The meaning of words depends on context, e.g.

    • “The striker was critical to the success of the team.”
    • “The manager was critical of the striker.”
  1. Sequential dependencies: The meaning of a sentence changes depending on word order

    • “The striker criticised the manager.”
    • “The manager criticised the striker.

Moving Beyond Bag-of-Words

Subword tokenisation: Motivation

Out of Vocabulary Problem

  • Using GloVe, we can retrieve embeddings for a vocabulary of 400k words.

  • Despite this, there will still be out of vocabulary words that we cannot represent

  • Domain-specific words: “biochemist”, “cryospheric”
  • Mispellings: “strker”, “strikerr”
  • Unusual suffixes: “solutionise”, “wrongitude”

Solutions

  • Expand the vocabulary?
    • Computationally intensive, some words are too rare.
  • Stemming/lemmatisation?
    • Results in information loss
  • Subword tokenisation
    • Break words into meaningful subtokens
    • Exploits information about the internal structure of words

Subword tokenisation: Detail

Common words are left intact; rarer words are decomposed:

  • ‘biologist’ \(\rightarrow\) ‘biologist’
  • ‘biochemist’ \(\rightarrow\) ‘bio’ + ‘##chemist’
  • ‘okay’ \(\rightarrow\) ‘okay’
  • ‘okeydokey’ \(\rightarrow\) ‘ok’ + ‘##ey’ + ‘##do’ + ‘##key’
  • Learn a representation for subword types, rather than word types
  • Representations for rare words can be reconstructed from subword representations
  • vector(bio) + vector(##chemist) = vector(biochemist)
  • Subword type representations (e.g. ‘##chemist’) can be reused to form representations of other infrequent words
  • e.g. astrochemist, electrochemist, photochemist
  • Very rare words can be broken down into character representations
  • e.g. ‘UCL’ \(\rightarrow\) ‘u’ + ‘##c’ + ‘##l’

Subword tokenisation: Example

Wordpiece tokeniser: Common words are left intact; rarer words are decomposed:

  • ‘biologist’ \(\rightarrow\) ‘biologist’
  • ‘biochemist’ \(\rightarrow\) ‘bio’ + ‘##chemist’
  • ‘okay’ \(\rightarrow\) ‘okay’
  • ‘okeydokey’ \(\rightarrow\) ‘ok’ + ‘##ey’ + ‘##do’ + ‘##key’




By recombining the embeddings of subtokens, we can meaningfully represent OOV words.

Contextual Word Embeddings: Intro

Recall sentences (c) and (d):

  • “The striker was critical to the success of the team.”
  • “The manager was critical of the striker.”

Static word embeddings represent each word-type with a vector:

Recall sentences (c) and (d):

  • “The1 striker1 was1 critical1 to1 the2 success1 of1 the3 team1.”
  • “The4 manager1 was2 critical2 of2 the5 striker2.”

Contextual word embeddings represent each word-token with a vector:

Contextual Word Embeddings: Intuition



The intuition (Smith 2020):

  • “With hindsight, we can now see that by representing word types independent of context, we were solving a problem that was harder than it needed to be. Because words mean different things in different contexts, we were requiring that type representations capture all of the possibilities. Moving to word token vectors simplifies things, asking the word token representation to capture only what a word means in this context. For the same reasons that the collection of contexts a word type is found in provide clues about its meaning(s), a particular token’s context provides clues about its specific meaning.”


  • Use the static word embeddings of tokens surrounding the target word to update our representation of the target word in context

Contextual Word Embeddings: Intuition


With static embeddings, we can retrieve a word’s vector by ‘looking it up’ in a big embedding matrix (e.g. pretrained Glove embeddings)…


Contextual Word Embeddings: Intuition


In contrast, a token’s contextual embedding is a function of the static embedding of it’s own word-type and the static embeddings of the other words in the sentence…


*This is a simplified representation, to be unpacked next week!

Contextual Word Embeddings: Training


  • The function f() is a neural network which is a complex non-linear function (more next week).
  • Similarly to static embeddings, the function f() can be learned through masked language modelling.

    • “I went to the [MASK] and saw a lion.”
    • “What shall we eat for [MASK] ?”

Contextual Word Embeddings: Training


  • The function f() is a neural network which is a complex non-linear function (more next week).
  • Similarly to static embeddings, the function f() can be learned through masked language modelling.

    • “I went to the [MASK] and saw a lion.”
    • “What shall we eat for [MASK] ?”
  • Training is very computationally and data intensive.

    • Google: “training … was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.” (Devlin et al. 2018)
  • Fortunately, we can make use of pre-trained models, such as BERT

Contextual Word Embeddings: BERT



BERT: Bidirectional Encoder Representations from Transformers

  • Language model developed by Google in 2018

  • Achieved state-of-the-art performance on a wide range of NLP tasks

  • Among other things, can produce contextual embeddings


Contextual Word Embeddings: Running example


Distinguishing meanings of ‘critical’






  1. “The attacker’s late goal won the match for his side.”

  2. “The striker’s contribution was crucial to the team’s victory.”

  3. “The striker was critical to the success of the team.”

  4. “The manager was critical of how the striker played.”

  5. “The coach fiercely condemned the striker’s performance.”

Contextual Word Embeddings: Applications


Natural Language Processing Applications:

  1. Named entity recognition: Is this token a named entity?

    • Apple released a new iPhone.”
    • Apple crumble is delicious.”
  1. Part-of-speech tagging: Which tokens are nouns, verbs, etc.?


Social Science Applications:

  1. Identifying dehumanising language…

Identifying Dehumanising Metaphors (Card et al.)

Identifying Dehumanising Metaphors (Card et al.)


we measure the extent to which mentions of immigrants in speeches ‘sound like’ a mention of several metaphorical categories that have been previously discussed in the literature on immigration: ‘animals,’ ‘cargo,’ ‘disease,’ ‘flood/tide,’ ‘machines,’ and ‘vermin’

Identifying Dehumanising Metaphors (Card et al.)

Procedure:

  • Original sentence: “… prevent the dumping of undesirable aliens into this country”

  • Masked sentence: “… prevent the dumping of undesirable [MASK] into this country”

  • ‘Cargo’ dictionary: (things, goods, stuff, …)

What is the probability that the masked token is from the Cargo dictionary?

Identifying Dehumanising Metaphors (Card et al.)



Example dehumanising metaphors related to animals identified by Card et al


Prob. Masked term Context
0.97 immigrants … it would be just as unreasonable to claim that we will not lower american standards by admitting to our country [MASK] that are of lower standards than ours as it is to assert that the breeding of the thoroughbred kentucky horses will not be injured by breeding them with texas mustangs.
0.66 aliens it establishes a positive framework to prevent illegal [MASK] from feeding at the public trough.

The story so far


We started with static word embeddings…


The story so far


Then we saw how we could aggregate these by summing and normalising…


The story so far


Then we saw how we could use subword tokenisation to capture OOV words…


The story so far


Then we introduced contextual word embeddings, to better capture meanings in context…


The story so far


We could now aggregate the contextual embeddings by summing and normalising?


Sentence-BERT Embeddings: Intro


Sentence-BERT embeddings aggregate contextual embeddings in a way that is optimised for semantic similarity…


*This is a simplified representation, to be unpacked next week!

Sentence-BERT Embeddings: Details


What does g() do?

  1. Pooling across tokens (e.g. averaging)
  2. Refines representation for sentence similarity tasks using a feed forward neural network.

Why not just pool? Why bother with (2)?

  • Helps avoid anisotropy (observations overly concentrated in one part of the space)

Sentence-BERT Embeddings: Details


What does g() do?

  1. Pooling across tokens (e.g. averaging)
  2. Refines representation for sentence similarity tasks using a feed forward neural network.

Why not just pool? Why bother with (2)?

  • Helps avoid anisotropy (observations overly concentrated in one part of the space)
  • Better discrimination between near and distant pairs

Training:

  • The f() components are trained with masked language modelling
  • The g() component is trained on human annotated and naturally annotated sentence pairs

Sentence-BERT Embeddings: Training data examples


Human annotated: Stanford Natural Language Inference dataset (570k pairs)

Text Hypothesis Label
A man inspects the uniform of a figure in some East Asian country. The man is sleeping Contradiction
An older and younger man smiling. Two men are smiling and laughing at the cats playing on the floor. Neutral
A soccer game with multiple males playing. Some men are playing a sport. Entailment


Naturally labelled: WikiAnswers duplicates (77m pairs tagged by WikiAnswers users)

Question 1 Question 2 Label
What is population of muslims in india? How many muslims make up indias 1 billion population? Duplicate
How can you tell if you have the flu? What are signs of the flu? Duplicate

Sentence-BERT Embeddings: Loss function(s)


  • Mean squared error loss: \(\frac{1}{n}\sum_{i=1}^n (\text{paraphrase}(a_i,b_i) - \text{cosine}(s_{a_i}, s_{b_i}))^2\)

    • \(\text{paraphrase}(a_i,b_i)=1\) if \(a_i\) and \(b_i\) paraphrase each other, 0 if not.


  • Contrastive learning: \(\text{max}(\text{cos}(s_{anchor}, s_{positive}) - \text{cos}(s_{anchor},s_{negative}) + \epsilon, 0)\)

    • Uses an anchor sentence, a positive sentence (which paraphrases the anchor) and a negative sentence (which does not paraphrase the anchor)
    • Aims to increase the relative distance between the anchor and negative, compared to the anchor and positive.

Sentence-BERT Embeddings: Pre-trained models


There are lots of models to choose from…

Sentence-BERT Embeddings: Running Example

The cosine similarity matrix looks like:



  1. “The attacker’s late goal won the match for his side.”

  2. “The striker’s contribution was crucial to the team’s victory.”

  3. “The striker was critical to the success of the team.”

  4. “The manager was critical of how the striker played.”

  5. “The coach fiercely condemned the striker’s performance.”

Pretty similar to what we’re aiming for:

Application I: Clustering

Motivating Problem Revisited


The task: to identify what ‘claims’ environmental protest groups make.


Problem: there are too many to read! (1m tweets)


Could solve with a topic model?

  • Texts are very short (10-40 words)
  • Interested in claims not topics
  • Sentence embeddings achieve more coherent clusters!

Motivating Problem Revisited


The task: to identify what ‘claims’ environmental protest groups make.


Problem: there are too many to read! (1m tweets)


Method:

  1. Encode tweets as sentence embeddings.

  2. Cluster using k-means clustering

  3. Label clusters (from most to least dense)

Step 1: Encode tweets as sentence embeddings


How to generate sentence embeddings with 7 lines of Python code…


# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer

# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']

# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)

# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer

# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']

# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)

# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer

# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']

# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)

# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer

# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']

# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)

# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer

# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']

# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)

# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer

# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']

# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)

# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')

Step 2: k-means Clustering

Properties:

  • Minimises within cluster sum of squares.
  • Clusters represented by centroid.


Algorithm:

  1. Randomly initialise k centroids.
  1. Assign each observation to the cluster with the nearest centroid.
  1. Calculate new centroids for each cluster.
  1. Repeat 2-3 until convergence.


Step 3: Label clusters


Within each cluster:

  1. Calculate distance of each tweet from its centroid

  2. Select the \(m\) nearest as ‘representative examples’

Example cluster:

Label: Fracking causes methane leaks
Fracking leaks methane into atmosphere, which is 80x more harmful to climate than CO2 - remind me why GOVUK wants to exploit more FFs?
methane is a potent greenhouse gas related to fracking, yet rarely gets a mention. Thanks to James Hansen for this timely intervention:
Fracking's Methane Leakage To Be Focus of Many Studies This Year
Methane leakage isn't as bad as previously thought in the US's big fracking areas
DeSmogBlog US: "Methane Leaks Wipe Out Any Climate Benefit Of Fracking, Satellite Observations Confirm"


Example claims

Claims 1-10
Oppose shale/fracking
Protect UK wildlife
Support sustainable fishing
Support rapid carbon emissions cuts
Climate change has devastating effects
Oppose airport expansion
Support transition to renewables
Tackling climate change makes economic sense
Support green growth
Oppose new oil and gas
Claims 11-20
Protect the Amazon
Palm oil causes deforestation
Investments in renewables make economic sense
Oppose trade deals that reduce environmental standards
Promote biodiversity
Pesticides harm bees
Support sustainable/organic farming
Renewables are the future
Oppose disposable coffee cups
Demand action on air pollution
Claims 21-30
Invest more in (green) public transport
Climate change linked to biodiversity loss
Protect our oceans
Coal less efficient than renewables
Protect endangered species
Oppose fossil fuel lobbying
Fracking causes earthquakes
Oppose nuclear weapons
Oppose fossil fuel company sponsorship / greenwashing
Air pollution harms health

Comparing protest groups

Reclaim the power top claims
Oppose shale/fracking
Oppose fossil fuel use
Fracking incompatible with climate goals
Public opposed to fracking
Fracking harms health and environment
Fracking does not make economic sense
Oppose 'dash for gas'
Fracking linked to methane leaks
Fracking causes earthquakes
Oppose fossil fuel lobbying
Greenpeace top claims
Protect our oceans
Plastic pollution harms oceans
Protect marine life
Soy production harms environment
Support bottle deposit scheme
Eat less meat
Oppose disposable coffee cups
Demand action on plastics
Protect the Amazon
Protect arctic/antarctic
XR top claims
Time is running out for climate action
Demand climate action from government
Climate change has devastating effects
Save the planet
We are heading for extinction
Address climate change for childrens' sake
Oppose unsustainable fashion
Promote biodiversity
Climate change hits poorest hardest
Support net zero

Alternative clustering algorithms


  • K-means is designed for spherical clusters. Struggles with other shapes.
  • Alternative: DBSCAN (Density Based Spacial Clustering of Applications with Noise)
  1. Distinguish between core and non-core observations based on their proximity to neighbours.

  2. Combine core observations that are near to each other.

  3. Assign non-core observations to nearest cluster if within \(\epsilon\) else remove as outlier.


Alternative labelling strategies


  1. Manual inspection
  1. Classed-based tf-idf
  • Identify keywords associated with each cluster using:
  • \(||tf_{x,c}||+log(1+\frac{A}{f_x})\)
  • \(tf_{x,c}\) = term frequency of word \(x\) in class \(c\)
  • \(A\) = avg. num. words per class
  • \(f_x\) = freq. word x in all classes
  1. Generative language models
  • Prompt with representative examples from cluster

BERTopic

BERTopic is a unified pipeline (in Python) for topic modelling using vector representations of texts


SBERT SpaCy Transformers Embeddings UMAP PCA TruncatedSVD Dimensionality Reduction HDBSCAN k-Means BIRCH Clustering CountVectorizer Jieba POS Tokenizer SpaCy PCA k-Means CountVectorizer c-TF-IDF Weighting scheme c-TF-IDF c-TF-IDF + BM25 c-TF-IDF + Normalization Representation Tuning (optional) GPT / T5 KeyBERT MMR TF-IDF TruncatedSVD BIRCH CountVectorizer c-TF-IDF + BM25 GPT

Application II: Information Retrieval

Does the media repeat protest group’s claims?


Search media corpus for sentences with high cosine similarity to the recovered claims.


  1. Use a two sentence sliding window to segment news articles

  2. Encode each segment in the same embedding space

  3. Measure the cosine similarity between each segment and each claim

  4. Assign segment to claim if \(\text{cos}(claim, segment) > \tau\)

Face Validity

Ecocide is a crime

cos_sim statement
0.82 If widespread or systematic destruction of the environment ("ecocide") is listed as a crime against humanity, the international community would have a responsibility to prevent and punish that activity. The severity of the categorisation...
0.82 A global campaign to make "ecocide" a crime under international law is to be launched tomorrow in an attempt to outlaw the worst kinds of environmental destruction. A grassroots movement called End Ecocide on Earth is seeking to have the...
0.79 A grassroots movement called End Ecocide on Earth is seeking to have the wholesale destruction of ecosystems ranked alongside offences such as genocide and war crimes. The International Criminal Court (ICC) would then be able to prosecut...

Fracking causes earthquakes

cos_sim statement
0.9 "Within a day of Cuadrilla restarting fracking in Lancashire, there has already been another earthquake which means they've had to down tools," said Friends of the Earth campaigner Tony Bosworth. "It appears that they cannot frack withou...
0.9 Page 2 2 The government has rejected an energy company's request to relax rules on earthquakes caused by fracking, despite claims that the limits could prevent it testing Britain's shale gas potential. Cuadrilla has caused nearly 30 trem...
0.9 Cuadrilla caused what is described as a "micro-seismic event" measuring 1.1 on the Richter scale at Preston New Road in Lancashire yesterday, the strongest of 27 tremors since it resumed fracking two weeks ago. Under the government's "tr...

Support net zero

cos_sim statement
0.88 "In the midst of a climate emergency, people across the UK are sending a clear message to the government that we need further and faster action to protect our environment and safeguard our planet for the future. "We were pleased to see g...
0.87 The Government's pledge to meet a net-zero target by 2050 is not a moment too soon. It must be commended for this bold commitment, echoing how the Climate Change Act gave the UK a world-leading role in tackling the issue that endangers t...
0.87 Reaching net zero by 2050 is an ambitious target, but it is crucial that we achieve it to ensure we protect our planet for future generations." The Government said it would retain the ability to use international carbon credits, which al...

Validation

How well does cosine similarity capture agreement?

Does the media repeat protest group’s claims?


Summary

Recap


Moved beyond bag-of-words in three ways:

  1. Subword tokenisation
  2. Contextual embeddings
  3. Sentence-BERT embeddings


Applications for:

  1. Clustering
  2. Information retrieval

Next week



  1. What is happening under the hood of BERT and other neural language models?


  1. How can we use neural langauge models directly for more precise classification?