Last week:
This week:
Neural networks: foundations
Neural language modelling
A neural network is a complex and highly flexible function that converts a set of numerical inputs into a set of numerical outputs.
Inputs:
Var | Value |
---|---|
Age | 24 |
Male | 0 |
Height | 174 |
\(\rightarrow\)
Outputs:
Class | Prob |
---|---|
Right-handed | 0.63 |
Left-handed | 0.36 |
A neural network is a complex and highly flexible function that converts a set of numerical inputs into a set of numerical outputs.
It’s composed of:
Each node takes in the values from nodes in the previous layer…
… processes them, and passes on the output to the next layer…
Neural networks have an extremely flexible functional form.
This flexibility comes at the price of a very large number of trainable parameters.
More flexible functional forms can approximate more complex relationships…
Selecting a learning rate
The learning rate determines how much model weights are updated in each step of gradient descent.
Selecting a learning rate
The learning rate determines how much model weights are updated in each step of gradient descent.
Solutions:
How many epochs to train for?
An epoch is one full pass through the training data
Solution:
Summary:
A language model is any model that represents text data probabilistically, learning patterns in how words appear together in order to make predictions, classify text, or generate new content.
\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & ...& 0 \end{bmatrix} \end{align}\]
\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0.63 & 0.14 & 0.02 & -0.58 & ...& -0.66 \end{bmatrix} \end{align}\]
Information flow is unidirectional
Problem: Does not capture word order well
Contains ‘cycles’, passes information across ‘time steps’
High level summary:
Transformers have become the ‘go-to’ for state-of-the-art NLP, and underly the recent massive advancements in AI.
Solves the problem of longe range dependencies through a mechanism called attention.
Allows for parallelisation (i.e. computationally efficient)
You have already seen a transformer model: sentence-BERT. Now let’s complicate this simplified representation…
Figure 1: A single transformer block
The attention mechanism captures relationships between tokens.
In each transformer block, the representation for token i is updated by ‘attending to’ the representations of other tokens.
The amount of attention paid to token j when updating token i’s representation is determined (roughly) by how similar their input representations are.
Pays more attention to:
The full transformer model consists of several stacked transformer block.
They are composed of units that perform simpler calculations as part of a sequence.
The hidden layers of a NN construct a latent representation of the data
Inputs are one-hot encodings / precomputed word embeddings
Outputs a probability distribution for the missing or next word.
There are a variety of possible architectures
Neural language models are designed to predict the next/missing word. But that’s not what social scientists do!
Training models with so many parameters is expensive and time consuming.
Use knowledge learned from one task (causal/masked language modelling) to boost performance on a related task.
Makes efficient use of computational resources
Makes efficient use of labelled data
Data: 3.3 billion word corpus from…
Task 1: Masked language modelling
Task 2: Next sentence prediction
Figure 1: Masked language modelling example
Figure 2: Next sentence prediction example
Procedure
Figure 1: Masked language modelling head
Figure 2: BERT encoder with no head
Figure 3: BERT encoder with classification head
Example text | Label |
---|---|
Apple’s latest [MASK] model features improved camera capabilities. | iPhone |
The Eiffel Tower is located in [MASK], France. | Paris |
Example text | Label |
---|---|
I’ve never been so delighted | positive |
This is unbearably boring | negative |
Model | Key Differences from BERT | Training Objective | Parameters |
---|---|---|---|
BERT | Baseline model | Masked Language Modeling (MLM), Next Sentence Prediction (NSP) | Base 110M, Large 340M |
RoBERTa | No NSP, more training data, dynamic masking | MLM only | Base 125M, Large 355M |
DistilBERT | Distilled version of BERT | Masked Language Modeling (MLM) only | 66M |
ALBERT | Parameter sharing | Masked Language Modeling (MLM) and sentence order prediction (SOP) | Base 12M, 18M |
Problem: Limits of general pretraining
Solution: Additional pre-training on corpus drawn from specific domain - e.g. FinBERT (finance), LegalBert (law), MedBert (medicine)
Example text | Label |
---|---|
Apple’s latest [MASK] model features improved camera capabilities. | iPhone |
The Eiffel Tower is located in [MASK], France. | Paris |
Example text | Label |
---|---|
The atmosphere is a global commons that responds to many types of [MASK] | emissions |
Adaptations are most likely to be stimulated by [MASK] variability | climatic |
Example text | Label |
---|---|
Grid & Infrastructure and Retail today represent the energy world of tomorrow | opportunity |
ANIC recognizes that increased claims activity resulting from catastrophic events | risk |
Domain adaptive pretraining results in improved performance on masked language modelling within the domain
Model | Loss |
---|---|
DistilRoBERTa | 2.238 |
ClimateBERT | 1.157 |
And on downstream classification tasks…
Model | Loss |
---|---|
DistilRoBERTa | 0.242 |
ClimateBERT | 0.191 |
Knowledge learned from a general task can be reused for a specific task using transfer learning.
The more similar the pretraining task, the less fine-tuning is needed for downstream tasks.
Is there a general task that is more similar to social science applications than MLM and NSP?
Context (c) | Hypothesis (h) | Status |
---|---|---|
We need overseas workers in the NHS | Immigration is a good thing | Entailment |
Immigrants are taking our jobs | Immigration is a bad thing | Entailment |
Income taxes should be reduced | Immigration is a good thing | Neutral |
The government should call an election | Immigration is a bad thing | Neutral |
We need overseas workers in the NHS | Immigration is a bad thing | Contradiction |
Immigrants are taking our jobs | Immigration is a good thing | Contradiction |
Repurpose NSP head for natural language inference…
Example: classify news headlines into a predefined set of topics {immigration, environment, other}
Fine-tuning BERT-NLI for classification avoids throwing away information between training phases.
Classical training sequence:
BERT-NLI for classification training sequence:
Laurer et al (2023) show that fine-tuning BERT-NLI outperforms fine-tuning BERT-base directly at low training same sizes.
In terms of software…
In terms of data… (very rough estimates)
Interfaces:
Managing package dependencies:
Key packages:
pandas
: dataframe structures (similar to dplyr
and tidyr
)numpy
: efficient numerical computing (vectorised computations as in R)sklearn
: machine learning functionstransformers
: by HuggingFace (no R equivalent!)Neural networks - in particular, transformer models - underpin recent advancements in NLP and AI
Social scientists can use transformer models for better classification through transfer learning
We can implement this using the transformers
package in Python, alongside HuggingFace.
How can generative language models be used directly for social science?
PUBL0099