Last week, we moved from sparse to dense representations of words
\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & ...& 0 \end{bmatrix} \end{align}\]
\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0.63 & 0.14 & 0.02 & -0.58 & ...& -0.66 \end{bmatrix} \end{align}\]
This came with several advantages:
Can we do the same thing now with sequences of words?
Do protest groups’ demands get replicated in the media?
Just Stop Oil activist:
“
This has sparked millions upon millions of conversations worldwide and I think that if even a tenth of those conversations mentioned new licences for oil and gas, that’s worth it.
”
How can we test this?
Data:
Corpus of approx. 1 million tweets by UK environmental protest groups
Corpus of approx. 130,000 UK news articles about the environment
Measurement tasks:
Cluster the tweets to discover the claims
Find new instances of claims in the news
Recall that word embeddings mapped synonymous or interchangeable words close together.
What is a meaningful semantic space for sentences?
Recall that word embeddings mapped synonymous or interchangeable words close together.
What is a meaningful semantic space for sentences?
Recall that word embeddings mapped synonymous or interchangeable words close together.
What is a meaningful semantic space for sentences?
Paraphrasings
Entailment/contradiction
Recall that word embeddings mapped synonymous or interchangeable words close together.
What is a meaningful semantic space for sentences?
Paraphrasings
Entailment/contradiction
Question-answer relevance
Which of these sentences ought to be clustered together?
A. “The attacker’s late goal won the match for his side.”
B. “The striker’s contribution was crucial to the team’s victory.”
C. “The striker was critical to the success of the team.”
D. “The manager was critical of how the striker played.”
E. “The coach fiercely condemned the striker’s performance.”
Which of these sentences ought to be clustered together?
A. “The attacker’s late goal won the match for his side.”
B. “The striker’s contribution was crucial to the team’s victory.”
C. “The striker was critical to the success of the team.”
D. “The manager was critical of how the striker played.”
E. “The coach fiercely condemned the striker’s performance.”
The ‘correct’ clustering:
A, B and C all attribute credit to the striker
D and E both criticise the striker
We should expect a cosine similarity matrix that looks like:
Bag-of-words approaches
Term frequency representation
Aggregate word embeddings
Beyond bag-of-words
Subword tokenisations
Contextualised embeddings
Sentence embeddings
Applications
Clustering
Information retrieval
Represent each document with a vector of length \(V\) (number of words in vocabulary), representing the count of each unique word-type.
Document-feature matrix of: 5 documents, 27 features (67.41% sparse) and 0 docvars.
features
docs the attack late goal won match for his side .
text1 2 1 1 1 1 1 1 1 1 1
text2 2 0 0 0 0 0 0 0 0 1
text3 3 0 0 0 0 0 0 0 0 1
text4 2 0 0 0 0 0 0 0 0 1
text5 2 0 0 0 0 0 0 0 0 1
[ reached max_nfeat ... 17 more features ]
The cosine similarity matrix looks like:
“The attacker’s late goal won the match for his side.”
“The striker’s contribution was crucial to the team’s victory.”
“The striker was critical to the success of the team.”
“The manager was critical of how the striker played.”
“The coach fiercely condemned the striker’s performance.”
“The attacker’s late goal won the match for his side.”
“The striker’s contribution was crucial to the team’s victory.”
“The striker was critical to the success of the team.”
“The manager was critical of how the striker played.”
“The coach fiercely condemned the striker’s performance.”
Recall, it should look like:
Word embeddings can help with sentences that use synonymous but different words, e.g.
How can we convert the dense representation of each word used…
V1 | V2 | V3 | V4 | V5 | V6 | V7 | ... | V300 | |
---|---|---|---|---|---|---|---|---|---|
the | 0.047 | 0.213 | -0.007 | -0.459 | -0.036 | 0.236 | -0.288 | ... | 0.054 |
attacker | -0.557 | 0.233 | -0.157 | 0.29 | 0.371 | -0.003 | -0.201 | ... | -0.131 |
late | 0.447 | -0.413 | -0.353 | 0.69 | 0.281 | -0.114 | -0.485 | ... | 0.067 |
goal | 0.552 | 1.332 | -0.476 | 0.117 | 0.129 | -0.217 | -0.481 | ... | -0.176 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
side | 0.097 | 0.717 | -0.119 | -0.408 | 0.016 | -0.224 | -0.086 | ... | 0.095 |
… into a single dense representation of the whole sentence?
V1 | V2 | V3 | V4 | V5 | V6 | V7 | ... | V300 | |
---|---|---|---|---|---|---|---|---|---|
Sentence 1 | 0.007 | 0.045 | 0.019 | -0.003 | 0.1 | 0.105 | 0.1 | ... | -0.017 |
We saw last week that arithmetic operations on word vectors produce meaningful results…
\[\text{vector(king)} - \text{vector(man)} + \text{vector(woman)} \approx \text{vector(queen)}\]
We could just add up all the words in the sentence to represent the whole sentence…
\[\text{vector(The)} + \text{vector(striker)} + \text{vector(was)} + \text{vector(critical)} + ... \approx \text{vector(sentence)}\]
Aggregation methods:
Should we sum or average?
Normalisation:
Summing will represent sentences with more words further from the origin.
Normalising ensures that every sentence is the same distance from the origin (whether we summed or averaged)
The cosine similarity matrix looks like:
“The attacker’s late goal won the match for his side.”
“The striker’s contribution was crucial to the team’s victory.”
“The striker was critical to the success of the team.”
“The manager was critical of how the striker played.”
“The coach fiercely condemned the striker’s performance.”
Recall, it should look like:
Rare / Out-of-vocabulary words (OOV)
Polysemy: The meaning of words depends on context, e.g.
Sequential dependencies: The meaning of a sentence changes depending on word order
Out of Vocabulary Problem
Using GloVe, we can retrieve embeddings for a vocabulary of 400k words.
Despite this, there will still be out of vocabulary words that we cannot represent
Solutions
Common words are left intact; rarer words are decomposed:
Wordpiece tokeniser: Common words are left intact; rarer words are decomposed:
By recombining the embeddings of subtokens, we can meaningfully represent OOV words.
Recall sentences (c) and (d):
Static word embeddings represent each word-type with a vector:
Recall sentences (c) and (d):
Contextual word embeddings represent each word-token with a vector:
The intuition (Smith 2020):
With static embeddings, we can retrieve a word’s vector by ‘looking it up’ in a big embedding matrix (e.g. pretrained Glove embeddings)…
In contrast, a token’s contextual embedding is a function of the static embedding of it’s own word-type and the static embeddings of the other words in the sentence…
*This is a simplified representation, to be unpacked next week!
Similarly to static embeddings, the function f() can be learned through masked language modelling.
Similarly to static embeddings, the function f() can be learned through masked language modelling.
Training is very computationally and data intensive.
BERT: Bidirectional Encoder Representations from Transformers
Language model developed by Google in 2018
Achieved state-of-the-art performance on a wide range of NLP tasks
Among other things, can produce contextual embeddings
Distinguishing meanings of ‘critical’
“The attacker’s late goal won the match for his side.”
“The striker’s contribution was crucial to the team’s victory.”
“The striker was critical to the success of the team.”
“The manager was critical of how the striker played.”
“The coach fiercely condemned the striker’s performance.”
Natural Language Processing Applications:
Named entity recognition: Is this token a named entity?
Social Science Applications:
“we measure the extent to which mentions of immigrants in speeches ‘sound like’ a mention of several metaphorical categories that have been previously discussed in the literature on immigration: ‘animals,’ ‘cargo,’ ‘disease,’ ‘flood/tide,’ ‘machines,’ and ‘vermin’”
Procedure:
Original sentence: “… prevent the dumping of undesirable aliens into this country”
Masked sentence: “… prevent the dumping of undesirable [MASK] into this country”
‘Cargo’ dictionary: (things, goods, stuff, …)
What is the probability that the masked token is from the Cargo dictionary?
Example dehumanising metaphors related to animals identified by Card et al
Prob. | Masked term | Context |
---|---|---|
0.97 | immigrants | … it would be just as unreasonable to claim that we will not lower american standards by admitting to our country [MASK] that are of lower standards than ours as it is to assert that the breeding of the thoroughbred kentucky horses will not be injured by breeding them with texas mustangs. |
0.66 | aliens | it establishes a positive framework to prevent illegal [MASK] from feeding at the public trough. |
We started with static word embeddings…
Then we saw how we could aggregate these by summing and normalising…
Then we saw how we could use subword tokenisation to capture OOV words…
Then we introduced contextual word embeddings, to better capture meanings in context…
We could now aggregate the contextual embeddings by summing and normalising?
Sentence-BERT embeddings aggregate contextual embeddings in a way that is optimised for semantic similarity…
*This is a simplified representation, to be unpacked next week!
What does g() do?
Why not just pool? Why bother with (2)?
What does g() do?
Why not just pool? Why bother with (2)?
Training:
Human annotated: Stanford Natural Language Inference dataset (570k pairs)
Text | Hypothesis | Label |
---|---|---|
A man inspects the uniform of a figure in some East Asian country. | The man is sleeping | Contradiction |
An older and younger man smiling. | Two men are smiling and laughing at the cats playing on the floor. | Neutral |
A soccer game with multiple males playing. | Some men are playing a sport. | Entailment |
Naturally labelled: WikiAnswers duplicates (77m pairs tagged by WikiAnswers users)
Question 1 | Question 2 | Label |
---|---|---|
What is population of muslims in india? | How many muslims make up indias 1 billion population? | Duplicate |
How can you tell if you have the flu? | What are signs of the flu? | Duplicate |
Mean squared error loss: \(\frac{1}{n}\sum_{i=1}^n (\text{paraphrase}(a_i,b_i) - \text{cosine}(s_{a_i}, s_{b_i}))^2\)
Contrastive learning: \(\text{max}(\text{cos}(s_{anchor}, s_{positive}) - \text{cos}(s_{anchor},s_{negative}) + \epsilon, 0)\)
There are lots of models to choose from…
The cosine similarity matrix looks like:
“The attacker’s late goal won the match for his side.”
“The striker’s contribution was crucial to the team’s victory.”
“The striker was critical to the success of the team.”
“The manager was critical of how the striker played.”
“The coach fiercely condemned the striker’s performance.”
Pretty similar to what we’re aiming for:
The task: to identify what ‘claims’ environmental protest groups make.
Problem: there are too many to read! (1m tweets)
Could solve with a topic model?
The task: to identify what ‘claims’ environmental protest groups make.
Problem: there are too many to read! (1m tweets)
Method:
Encode tweets as sentence embeddings.
Cluster using k-means clustering
Label clusters (from most to least dense)
How to generate sentence embeddings with 7 lines of Python code…
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']
# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)
# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']
# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)
# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']
# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)
# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']
# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)
# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']
# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)
# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
# Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
# Import tweets to be encoded
texts_df = pd.read_csv('tweets_to_be_encoded.csv')
tweets = texts_df['processed_tweet']
# Download your preferred sentence-BERT model
model_preferred = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the tweets as sentence embeddings
tweet_embeddings = model_preferred.encode(tweets, show_progress_bar = True)
# Export as a csv file
pd.DataFrame(tweet_embeddings).to_csv('tweet_embeddings.csv')
Properties:
Algorithm:
Within each cluster:
Calculate distance of each tweet from its centroid
Select the \(m\) nearest as ‘representative examples’
Example cluster:
Label: Fracking causes methane leaks |
---|
Fracking leaks methane into atmosphere, which is 80x more harmful to climate than CO2 - remind me why GOVUK wants to exploit more FFs? |
methane is a potent greenhouse gas related to fracking, yet rarely gets a mention. Thanks to James Hansen for this timely intervention: |
Fracking's Methane Leakage To Be Focus of Many Studies This Year |
Methane leakage isn't as bad as previously thought in the US's big fracking areas |
DeSmogBlog US: "Methane Leaks Wipe Out Any Climate Benefit Of Fracking, Satellite Observations Confirm" |
Claims 1-10 |
---|
Oppose shale/fracking |
Protect UK wildlife |
Support sustainable fishing |
Support rapid carbon emissions cuts |
Climate change has devastating effects |
Oppose airport expansion |
Support transition to renewables |
Tackling climate change makes economic sense |
Support green growth |
Oppose new oil and gas |
Claims 11-20 |
---|
Protect the Amazon |
Palm oil causes deforestation |
Investments in renewables make economic sense |
Oppose trade deals that reduce environmental standards |
Promote biodiversity |
Pesticides harm bees |
Support sustainable/organic farming |
Renewables are the future |
Oppose disposable coffee cups |
Demand action on air pollution |
Claims 21-30 |
---|
Invest more in (green) public transport |
Climate change linked to biodiversity loss |
Protect our oceans |
Coal less efficient than renewables |
Protect endangered species |
Oppose fossil fuel lobbying |
Fracking causes earthquakes |
Oppose nuclear weapons |
Oppose fossil fuel company sponsorship / greenwashing |
Air pollution harms health |
Reclaim the power top claims |
---|
Oppose shale/fracking |
Oppose fossil fuel use |
Fracking incompatible with climate goals |
Public opposed to fracking |
Fracking harms health and environment |
Fracking does not make economic sense |
Oppose 'dash for gas' |
Fracking linked to methane leaks |
Fracking causes earthquakes |
Oppose fossil fuel lobbying |
Greenpeace top claims |
---|
Protect our oceans |
Plastic pollution harms oceans |
Protect marine life |
Soy production harms environment |
Support bottle deposit scheme |
Eat less meat |
Oppose disposable coffee cups |
Demand action on plastics |
Protect the Amazon |
Protect arctic/antarctic |
XR top claims |
---|
Time is running out for climate action |
Demand climate action from government |
Climate change has devastating effects |
Save the planet |
We are heading for extinction |
Address climate change for childrens' sake |
Oppose unsustainable fashion |
Promote biodiversity |
Climate change hits poorest hardest |
Support net zero |
Distinguish between core and non-core observations based on their proximity to neighbours.
Combine core observations that are near to each other.
Assign non-core observations to nearest cluster if within \(\epsilon\) else remove as outlier.
BERTopic is a unified pipeline (in Python) for topic modelling using vector representations of texts
Search media corpus for sentences with high cosine similarity to the recovered claims.
Use a two sentence sliding window to segment news articles
Encode each segment in the same embedding space
Measure the cosine similarity between each segment and each claim
Assign segment to claim if \(\text{cos}(claim, segment) > \tau\)
Ecocide is a crime
cos_sim | statement |
---|---|
0.82 | If widespread or systematic destruction of the environment ("ecocide") is listed as a crime against humanity, the international community would have a responsibility to prevent and punish that activity. The severity of the categorisation... |
0.82 | A global campaign to make "ecocide" a crime under international law is to be launched tomorrow in an attempt to outlaw the worst kinds of environmental destruction. A grassroots movement called End Ecocide on Earth is seeking to have the... |
0.79 | A grassroots movement called End Ecocide on Earth is seeking to have the wholesale destruction of ecosystems ranked alongside offences such as genocide and war crimes. The International Criminal Court (ICC) would then be able to prosecut... |
Fracking causes earthquakes
cos_sim | statement |
---|---|
0.9 | "Within a day of Cuadrilla restarting fracking in Lancashire, there has already been another earthquake which means they've had to down tools," said Friends of the Earth campaigner Tony Bosworth. "It appears that they cannot frack withou... |
0.9 | Page 2 2 The government has rejected an energy company's request to relax rules on earthquakes caused by fracking, despite claims that the limits could prevent it testing Britain's shale gas potential. Cuadrilla has caused nearly 30 trem... |
0.9 | Cuadrilla caused what is described as a "micro-seismic event" measuring 1.1 on the Richter scale at Preston New Road in Lancashire yesterday, the strongest of 27 tremors since it resumed fracking two weeks ago. Under the government's "tr... |
Support net zero
cos_sim | statement |
---|---|
0.88 | "In the midst of a climate emergency, people across the UK are sending a clear message to the government that we need further and faster action to protect our environment and safeguard our planet for the future. "We were pleased to see g... |
0.87 | The Government's pledge to meet a net-zero target by 2050 is not a moment too soon. It must be commended for this bold commitment, echoing how the Climate Change Act gave the UK a world-leading role in tackling the issue that endangers t... |
0.87 | Reaching net zero by 2050 is an ambitious target, but it is crucial that we achieve it to ensure we protect our planet for future generations." The Government said it would retain the ability to use international carbon credits, which al... |
How well does cosine similarity capture agreement?
Moved beyond bag-of-words in three ways:
Applications for:
PUBL0099