10: Quick overview of Large Language Models

Jack Blumenau & Gloria Gennaro

Large Language Models

Dependencies in Language

  • In this course we have largely focused on bag-of-words models

    • Tend to be a good choice for a wide range of datasets and tasks in text analysis in the social sciences
    • Dictionaries; topic models; Naive Bayes; etc
  • Bag-of-word classifiers based on term frequencies can achieve good performance

  • Yet, when the interdependent nature of words becomes important, more advanced models can become helpful that capture the dependencies in language

Language modelling

  • We have seen several examples of language models, which we have taken to be probabilistic descriptions of word counts in documents

    • Naive Bayes: a distribution over words for each category
    • Topic models: a distribution over words for each topic; a distribution over topics for each document
  • In all instances, we have considered bag-of-words models – models that do not take word order or dependency into account

  • More advanced language models provide probabilistic descriptions for word sequences

  • For instance, given a sequence of words, a language model might try to predict the word that comes next

Language Models

Language modelling: the task of teaching an algorithm to predict/generate what comes next

the students opened their ____

the students opened their books (?)

the students opened their laptops (?)

the students opened their exams (?)

the students opened their minds (?)

More formally: given a sequence of words \(x^1, x^2, ..., x^t\), compute the probability distribution of the next word \(x^{t+1}\):

\[P\left(x^{t+1}|x^{t},...,x^{1}\right)\]

where \(x^{t+1}\) can be any word in the vocabulary \(V\).

Language Modelling Applications

Language models should be very familiar to you!

Why Should Social Scientists Care About Language Models?

  • Language modelling has become a benchmark test that helps us measure our progress on predicting language use

  • More relevantly to social scientists, language modelling is now a subcomponent of many NLP tasks, including those we have studied on this course

    • Topic modelling
    • Document classification
    • Sentiment analysis
    • etc
  • Virtually all state-of-the-art natural language processing tools are based on language models of different types

  • Social scientists are beginning to adopt these methods!

n-gram Language Models

Question: How might we learn a language model?

Old Answer: Use n-grams!

Idea: Collect statistics about how frequent different n-grams are…

  1. the students opened their books
  2. the students opened their laptops
  3. the students opened their exams
  4. the students opened their minds

…and use these to predict the next word when we see the phrase “the students opened their…”.

\[P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their } w)}{\text{count}(\text{students opened their})}\]

Note: n-gram models require the Markov assumption: the word at \(x^{t+1}\) depends only on the preceding \(n-1\) words!

n-gram Language Models

Suppose we are learning a 4-gram language model.

as the proctor started the clock, the students opened their ____

\[P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their } w)}{\text{count}(\text{students opened their})}\]

Example:

  • Suppose we have a large corpus of text

  • “students opened their” occurred 1000 times

  • “students opened their books” occurred 400 times

    • \(\rightarrow P(\text{books}|\text{students opened their}) = 0.4\)
  • “students opened their exams” occurred 100 times

    • \(\rightarrow P(\text{books}|\text{students opened their}) = 0.1\)

In this example, discarding the word “proctor” results in the wrong prediction!

n-gram Language Models

The core problem with n-gram language models is sparsity.

  • What if “students opened their \(w\)” doesn’t occur in the data?

    • \(\text{count}(\text{students opened their } w) = 0\)
    • \(P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their } w)}{\text{count}(\text{students opened their})} = 0\)
    • The probability of word \(w\) is zero!
  • What if “students opened their” doesn’t occur in the data?

    • Then we can’t calculate the probability for any word!
  • Increasing the size of the n-gram makes these sparsity problems worse

    • if “students opened their” only occurs 1000 times, “as the proctor started the clock, the students opened their” will occur many fewer times!
    • Trade-off between model accuracy and sparsity
  • Increasing the size of the corpus helps with this problem a bit, but not much

\(\rightarrow\) n-gram models are good for clarifying the intuition behind a language model but are not very useful in practice

Neural Language Models

  • These are models that can capture dependencies between words without running into the sparsity problems that affect n-gram models

  • These models process sequences of inputs and predict sequences of outputs

    • Input: Each context word is associated with an embedding vector. The input vector is a concatenation of those vectors.

    • Output: probability distribution over the next word

  • The key innovation is that they are based on dense representations of words (embeddings), rather than sparse representations

    • Removes the sparsity problem!
    • Better treatment of out-of vocabulary words
  • Each word in a sequence updates a set of parameters which are then used to predict the final word in the sequence

    • Model architecture is often a RNN, that allows for longer sequences and for words further away from the target word to have predictive power
    • The last softmax layer is the most computationally intensive

Neural Language Models

Advantages

  1. Can process inputs of any length (not limited to 3 or 4 n-grams)

  2. The prediction for \(x^{t+1}\) can use information from many steps back (at least in theory)

  3. The model size does not increase for longer input sequences

Disadvantages

  1. Computation is very slow

  2. In practice, it tends to be that predictions are dominated by words close to the target word in the sequence (i.e. we still lack a way of seeing the importance of “proctor”)

Attention

Consider these two sentences:

As a leading firm in the ___ sector, we hire highly skilled software engineers.

As a leading firm in the ___ sector, we hire highly skilled petroleum engineers.

  • A human finds it easy to predict the missing words on the basis of the difference between “software” and “petroleum”.

  • Word-embedding methods like Word2Vec, which also tried to predict words from their context, struggle because they weight all words in the context window equally when constructing embeddings

  • Major breakthrough in modern NLP: train algorithms to also “pay attention” to the relevant features for prediction problems (Vaswani et al. 2023)

Attention

Words have a darker shading when they are given more weight in the prediction problem.

Attention heads are filters which, for each word, scan over every other word in the document and pick up predictive interactions

Transformers

The key innovation of transformer models consists of introducing attention in a neural-net architecture

  • Input: a sequence of words

  • Output: a prediction for what word comes next, and a sequence of contextual embeddings that represents the contextual meaning of each of the input words

  • Attention allows a network to directly extract and use information from arbitrarily large contexts.

  • At each layer, the transformer computes a representation of word i that combines information from the representation of i at the previous layer, with information from the representations of the neighboring words –> contextualized embeddings

Transformer-based Language Models

Two main models:

  • Autoregressive models (e.g. GPT):

      - pretrained on classic language modeling task: guess the next token having read all the previous ones.
      - during training, attention heads only view previous tokens, not subsequent tokens
      - Ideal for text generation
  • Autoencoding models (e.g. BERT):

    • pretrained by dropping/shuffling input tokens and trying to reconstruct the original sequence
    • usually build bidirectional representations and get access to the full sequence
    • can be fine-tuned and achieve great results on many tasks, e.g. text classification

R packages

You need a python installation, and to call python code from R. Here some initial steps:

# For importing python code
library(reticulate)

# Specify the path to your Python installation
use_python("/usr/bin/python3") 

reticulate::py_install('transformers', pip = TRUE)

transformer = reticulate::import('transformers')
tf = reticulate::import('tensorflow')
builtins <- import_builtins() #built in python methods

tokenizer <- transformer$AutoTokenizer$from_pretrained('bert-base-uncased')
bert.model <- transformer$TFBertModel$from_pretrained("bert-base-uncased")

Social Science Applications

A Model To Rule Them All

  • One of the major strengths of LLMs is that they can perform a wide variety of tasks using the same modelling infrastructure

  • Transformer infrastructure can be used for classical NLP tasks:

  • Transformer models give us an ability to generate text, not just measure latent concepts

  • Generative AI can be used for new applications, such as

Conclusion

Conclusion

  • Language models describe a story about how texts are generated, using probabilities

  • They try to predict what comes next

  • Modern large language model are built on transformer infrastructure and use dense language representations

  • You can use them to solve many classic NLP tasks

  • You can now update your CV!

Final Q&A