9: Neural Networks, Transfer Learning and Transformer Models

Michael Jacobs

Motivation and Overview

Recap


Last week:

  • Dense vector representations of words \(\rightarrow\) dense vector representations of sentences
  • We did this with the help of neural network language models


This week:

  • How do neural network language models work?
  • How can we train/fine-tune our own model for classification?

Why Use Neural Language Models?


  • Better performance in supervised classification of documents
  • Models can be adapted for new tasks using transfer learning
    • e.g. Fine-tune sentiment classifier to specific domain
    • Zero-shot classification
  • Can also be used for text generation
    • Text summarisation
    • Translation
    • Chatbots

Roadmap

  1. Neural networks: foundations

  2. Neural language modelling

  • Feed forward neural networks
  • Recurrent neural networks
  • Transformer architecture
  1. Transfer learning
  • Pretraining
  • Fine tuning
  • BERT, BERT+classifier, BERT-NLI
  1. Practical considerations

Neural networks: foundations

What is a Neural Network?


A neural network is a complex and highly flexible function that converts a set of numerical inputs into a set of numerical outputs.


Inputs:

Var Value
Age 24
Male 0
Height 174




\(\rightarrow\)

Outputs:

Class Prob
Right-handed 0.63
Left-handed 0.36

What is a Neural Network?


A neural network is a complex and highly flexible function that converts a set of numerical inputs into a set of numerical outputs.

What is a Neural Network?


It’s composed of:

  • Input layer
  • Hidden layers
  • Output layer

What is a Neural Network?


Each node takes in the values from nodes in the previous layer…

… processes them, and passes on the output to the next layer…

Inside a single node

Activation functions

Output layer

Why are Neural Networks so Powerful?

  • Neural networks have an extremely flexible functional form.

    • Universal Approximation Theorem: a neural network can approximate any function.
  • This flexibility comes at the price of a very large number of trainable parameters.

    • N. Param \(\approx L \times (n^2 + n)\)
    • \(L\) is N of layers; \(n\) is nodes per layer
    • e.g. \(L=3\), \(n=100\), N. Param = \(30,300\)
  • This calls for a large quantity of training data

More flexible functional forms can approximate more complex relationships…

Training a Neural Network

Prerequisites

  • A task: e.g. classify left v right handed
  • A loss function e.g. \(L(\hat{y}, y) = -\sum_k \mathbf{y}_k\text{log}(\mathbf{\hat{y}}_k)\)
  • Labelled data: [0, 1, 1, 0, 1, 0]

Steps

  1. Initialise the model parameters randomly.
  1. Forward pass through the model
  1. Calculate loss using outputs
  1. Backpropagation (work out gradients of loss)
  • Calculate change in loss given change in weight: \(\frac{\partial L(\hat{y}, y)}{\partial w_j}\)
  1. Update parameters using gradient descent
  • \(w_{t+1} = w_{t} - \eta \frac{\partial L}{\partial w_{t}}\)
  1. Repeat until convergence

Training a Neural Network: Considerations

Selecting a learning rate

  • The learning rate determines how much model weights are updated in each step of gradient descent.

    • Too high \(\rightarrow\) The model may overshoot minimum, causing instability.
    • Too low \(\rightarrow\) The model learns slowly and may get stuck in suboptimal solutions.

Training a Neural Network: Considerations

Selecting a learning rate

  • The learning rate determines how much model weights are updated in each step of gradient descent.

    • Too high \(\rightarrow\) The model may overshoot minimum, causing instability.
    • Too low \(\rightarrow\) The model learns slowly and may get stuck in suboptimal solutions.
  • Solutions:

    • Learning rate schedulers: reduce the learning rate as you proceed to avoid overshooting.
    • Optimisers: adjust the learning rate automatically as you go along, based on gradient history.

How many epochs to train for?

  • An epoch is one full pass through the training data

    • Too few epochs \(\rightarrow\) Underfitting (model doesn’t learn enough patterns).
    • Too many epochs \(\rightarrow\) Overfitting (model memorizes training data instead of generalizing).
  • Solution:

    • Early stopping: stop when validation loss appears to have passed minimum (validation set used only for evaluation, not to update the parameters)

Training a Neural Network


Summary:

  • Training is iterative as compared to (e.g.) naive Bayes which is one and done.
  • Each iteration incrementally improves performance.
  • Due to the large number of parameters, training is computationally expensive.

Neural language modelling

What is a language model?


A language model is any model that represents text data probabilistically, learning patterns in how words appear together in order to make predictions, classify text, or generate new content.


  1. Causal language model: predicts next word, given word history.
  • “Can you predict the next word in this [MASK]”
  • e.g. GPT (Generative Pretrained Transformers)


  1. Masked language model: predicts missing word, given surrounding words.
  • “Can you guess which [MASK] is missing from this sentence?”
  • e.g. BERT (Bidirectional Encoder Representations from Transformers)

How can we use NN to model language?


What are the inputs?

  • One-hot encodings
  • Precomputed word-type embeddings

What are the outputs?

  • Probability distribution over word-types


What architecture?

  • Feed forward neural network
  • Recurrent neural network
  • Transformers




\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & ...& 0 \end{bmatrix} \end{align}\]

\[\begin{align} w_{\text{debt}} &= \begin{bmatrix} 0.63 & 0.14 & 0.02 & -0.58 & ...& -0.66 \end{bmatrix} \end{align}\]

Model Architectures


Feed forward neural network

Information flow is unidirectional

  1. Input: Concatenated word embeddings
  2. Hidden layer
  3. Output: softmax over full vocabulary

Problem: Does not capture word order well

  • Learns different weights for each relative position
  • No explicit modelling of sequences

Model Architectures


Recurrent Neural Network

Contains ‘cycles’, passes information across ‘time steps’

  • Pro: Deals well with short-range sequential dependencies
    • e.g. “It’s time to eat [MASK]”
  • Con: Struggles with long range dependencies
    • e.g. “I’m so hungry, I didn’t have time for [MASK]”


The Transformer Architecture


High level summary:

  • Transformers have become the ‘go-to’ for state-of-the-art NLP, and underly the recent massive advancements in AI.

  • Solves the problem of longe range dependencies through a mechanism called attention.

  • Allows for parallelisation (i.e. computationally efficient)

The Transformer Architecture

You have already seen a transformer model: sentence-BERT. Now let’s complicate this simplified representation…

The Transformer Block

Figure 1: A single transformer block


The attention mechanism captures relationships between tokens.

  • Context-signifiers
    • “I walked to the bank of the river
  • Adjective-noun relationships
    • “He is a wreckless accountant
  • Co-references
    • “The chicken did not cross the road because it was too tired

Attention mechanism detail

  • In each transformer block, the representation for token i is updated by ‘attending to’ the representations of other tokens.

  • The amount of attention paid to token j when updating token i’s representation is determined (roughly) by how similar their input representations are.

  • Pays more attention to:

    • nearer words (due to positional embeddings)
    • semantically related words (even if further away)

Attention mechanism detail

The Transformer Architecture




The full transformer model consists of several stacked transformer block.


Summary so far

  1. Neural networks in general are complex, non-linear functions that map numerical inputs to numerical outputs.
  • They are composed of units that perform simpler calculations as part of a sequence.

  • The hidden layers of a NN construct a latent representation of the data

  1. Neural networks can be trained using gradient descent which involves iteratively updating parameters based on labelled examples.
  1. Neural networks can be applied to language modelling, where:
  • Inputs are one-hot encodings / precomputed word embeddings

  • Outputs a probability distribution for the missing or next word.

  • There are a variety of possible architectures

  1. The transformer architecture is a particularly effective architecture for NLP

Transfer learning

How can we use these models?


  1. Problems:
  • Neural language models are designed to predict the next/missing word. But that’s not what social scientists do!

  • Training models with so many parameters is expensive and time consuming.


  1. Solution: Transfer Learning
  • Use knowledge learned from one task (causal/masked language modelling) to boost performance on a related task.

  • Makes efficient use of computational resources

  • Makes efficient use of labelled data

Terminology


  1. Pre-training: Training a model from scratch (initialise parameters randomly)


  1. Fine-tuning: Update the parameters of a model that has already been trained on a large amount of data using new training data suited for a new task.
  • This might involve modifying the model architecture.
    • one-hot encodings \(\rightarrow\) contextual embeddings \(\rightarrow\) prob. dist. words
    • one-hot encodings \(\rightarrow\) contextual embeddings \(\rightarrow\) prob. dist. sentiment

BERT Pre-training


Data: 3.3 billion word corpus from…

  • English Wikipedia
  • BooksCorpus


Task 1: Masked language modelling


Task 2: Next sentence prediction


Figure 1: Masked language modelling example


Figure 2: Next sentence prediction example

BERT + Classifier head

Procedure

  1. BERT pre-training
  • Masked language modelling (MLM)
  • Next sentence prediction (NSP)
  1. Modify architecture:
  1. Remove MLM/NSP head(s)
  1. Add a classifier head
  1. Train classifier


Figure 1: Masked language modelling head


Figure 2: BERT encoder with no head


Figure 3: BERT encoder with classification head

BERT + Classifier head: Example



  1. Pretraining (general domain)
  • Task: Masked language modelling
  • Data: Common Crawl
  1. Training (downstream tasks)
  • Task: classification
  • Data: hand-coded paragraphs
Example text Label
Apple’s latest [MASK] model features improved camera capabilities. iPhone
The Eiffel Tower is located in [MASK], France. Paris


Example text Label
I’ve never been so delighted positive
This is unbearably boring negative

BERT Variations



Model Key Differences from BERT Training Objective Parameters
BERT Baseline model Masked Language Modeling (MLM), Next Sentence Prediction (NSP) Base 110M, Large 340M
RoBERTa No NSP, more training data, dynamic masking MLM only Base 125M, Large 355M
DistilBERT Distilled version of BERT Masked Language Modeling (MLM) only 66M
ALBERT Parameter sharing Masked Language Modeling (MLM) and sentence order prediction (SOP) Base 12M, 18M

Domain adaptive pretraining


Problem: Limits of general pretraining

  • Domain specific words are not included as full words in vocabulary (i.e. broken into subwords)
    • e.g. ‘CO2’, ‘estimation’, ‘ecological’
  • Words may have domain-specific meanings that do not occur frequently enough in pretraining
    • e.g. ‘derivative’ in maths vs finance vs literature

Solution: Additional pre-training on corpus drawn from specific domain - e.g. FinBERT (finance), LegalBert (law), MedBert (medicine)

ClimateBERT

  1. Pretraining (general domain)
  • Task: Masked language modelling
  • Data: Common Crawl
  1. Domain adaptive pretraining (climate specific)
  • Task: Masked language modelling
  • Data: climate-related texts
  1. Training (downstream tasks)
  • Task: classification
  • Data: hand-coded paragraphs
Example text Label
Apple’s latest [MASK] model features improved camera capabilities. iPhone
The Eiffel Tower is located in [MASK], France. Paris


Example text Label
The atmosphere is a global commons that responds to many types of [MASK] emissions
Adaptations are most likely to be stimulated by [MASK] variability climatic


Example text Label
Grid & Infrastructure and Retail today represent the energy world of tomorrow opportunity
ANIC recognizes that increased claims activity resulting from catastrophic events risk

ClimateBERT


Domain adaptive pretraining results in improved performance on masked language modelling within the domain

Model Loss
DistilRoBERTa 2.238
ClimateBERT 1.157

And on downstream classification tasks…

Model Loss
DistilRoBERTa 0.242
ClimateBERT 0.191

NLI as a ‘universal task’


  • Knowledge learned from a general task can be reused for a specific task using transfer learning.

  • The more similar the pretraining task, the less fine-tuning is needed for downstream tasks.

  • Is there a general task that is more similar to social science applications than MLM and NSP?

    • Yes, natural language inference!

What is Natural Language Inference?

  • Natural language inference is a reasoning task of determining whether a context entails/contradicts a hypothesis
  • \(context\) entails \(hypothesis\) if \(\Pr(hypothesis|context) \approx 1\)
  • \(context\) contradicts \(hypothesis\) if \(\Pr(hypothesis|context) \approx 0\)
  • \(context\) is neutral w.r.t. \(hypothesis\) if \(\Pr(hypothesis|context) \approx \Pr(hypothesis)\)


Context (c) Hypothesis (h) Status
We need overseas workers in the NHS Immigration is a good thing Entailment
Immigrants are taking our jobs Immigration is a bad thing Entailment
Income taxes should be reduced Immigration is a good thing Neutral
The government should call an election Immigration is a bad thing Neutral
We need overseas workers in the NHS Immigration is a bad thing Contradiction
Immigrants are taking our jobs Immigration is a good thing Contradiction

How can BERT be used for NLI?

Repurpose NSP head for natural language inference…

BERT-NLI

Zero-shot Classification with BERT-NLI

Example: classify news headlines into a predefined set of topics {immigration, environment, other}

  1. Define a hypothesis for each category
  • H1: The headline is about immigration
  • H2: The headline is about the environment
  • H3: The headline is not about immigration or the environment
  1. Concatenate the headlines and hypotheses:
  • e.g. “[CLS] Headline : ’ Net immigration reaches record high ’ [SEP] The headline is about immigration”
  • e.g. “[CLS] Headline : ’ Net immigration reaches record high ’ [SEP] The headline is about the environment”
  • e.g. “[CLS] Headline : ’ Net immigration reaches record high ’ [SEP] The headline is not about immigration or the environment”
  1. Inference: Use pre-trained BERT-NLI model to predict probability of entailment for each hypothesis (use softmax to force probabilities to add to 1 for each headline)
  • e.g. [0.9, 0.03, 0.07]
  1. Assign headline to most probable hypothesis

Fine-tuning BERT-NLI

Fine-tuning BERT-NLI for classification avoids throwing away information between training phases.


Classical training sequence:

  • Pre-training (MLM + NSP) \(\rightarrow\) Modify architecture \(\rightarrow\) Fine-tune for classification


BERT-NLI for classification training sequence:

  • Pre-training (MLM + NSP) \(\rightarrow\) Modify architecture \(\rightarrow\) Fine-tune for NLI \(\rightarrow\) Fine-tune for classification

Fine-tuning BERT-NLI


Laurer et al (2023) show that fine-tuning BERT-NLI outperforms fine-tuning BERT-base directly at low training same sizes.

Practical Considerations

How can I use a Transformer model for my research?


In terms of software…

  • Some Python is required
  • HuggingFace API (through Python)

In terms of data… (very rough estimates)

  • To train a classifier head from scratch: >1,000 labelled examples
  • To fine tune an existing classifier to a new domain: 100-1,000 labelled examples
  • To fine tune BERT-NLI for classification: >300 labelled examples
  • To perform zero-shot classification: 0 labelled examples (but potentially poor performance)

HuggingFace

Python for R users


Interfaces:

  • Google colab* (simplest for beginners)
  • Jupiter notebook
  • Spyder (similar layout to R studio)

Managing package dependencies:

  • Temporary environments (used in Google colab)
  • Virtual environments (requires terminal operations)
  • Conda environments


Key packages:

  • pandas: dataframe structures (similar to dplyr and tidyr)
  • numpy: efficient numerical computing (vectorised computations as in R)
  • sklearn: machine learning functions
  • transformers: by HuggingFace (no R equivalent!)

Summary

Key takeaways


  • Neural networks - in particular, transformer models - underpin recent advancements in NLP and AI

  • Social scientists can use transformer models for better classification through transfer learning

    • BERT \(\rightarrow\) finetune classifier head
    • BERT \(\rightarrow\) Domain adaptive pretraining \(\rightarrow\) finetune classifier head
    • BERT \(\rightarrow\) BERT-NLI \(\rightarrow\) finetune NLI head for classification
  • We can implement this using the transformers package in Python, alongside HuggingFace.

Next week



How can generative language models be used directly for social science?