10: Generative Language Models in Social Science Research

Jack Blumenau

Course Outline

  1. Representing Text as Data (I): Bag-of-Words
  2. Similarity, Difference, and Complexity
  3. Language Models (I): Supervised Learning for Text Data
  4. Language Models (II): Topic Models
  5. Collecting Text Data
  6. Causal Inference with Text
  7. Representing Text as Data (II): Word Embeddings
  8. Representing Text as Data (III): Word Sequences
  9. Language Models (III): Neural Networks, Transfer Learning and Transformer Models
  10. Language Models (IV): Generative Language Models in Social Science Research👈

Causal language modelling

Introduction

In last week’s lecture, we learned:

  1. Neural networks are powerful approaches to learning complex, non-linear relationships between inputs and outputs

  2. Neural language models represents words as embeddings, which are passed through hidden layers to predict the next word in a sequence

  3. Transformer models include an attention mechanism, which allows the model to focus on the most relevant words in the input when making predictions

When we discussed applications, we mostly focused on transfer learning: using knowledge learned from one task to improve performance on a related task. E.g.

  • Masked-language modelling \(\rightarrow\) document classification
  • Next-sentence prediction \(\rightarrow\) Natural Language Inference

However, LLMs can also be used for generative or causal language modelling, something that is increasingly important in the social sciences.

Causal Language Modelling

Causal Language Modelling: the task of teaching an algorithm to predict/generate what comes next

the students opened their ____

the students opened their books (?)

the students opened their laptops (?)

the students opened their exams (?)

the students opened their minds (?)

the students opened their refrigerators (?)

More formally: given a sequence of words \(w_1, w_2, ..., w_t\), compute the probability distribution of the next word \(w_{t+1}\):

\[P\left(w_{t+1}|w_{t},...,w_{1}\right)\]

where \(w_{t+1}\) can be any word in the vocabulary \(V\).

Causal Language Modelling Applications

Causal language models will be very familiar to you!

Generative n-gram models

Generative n-gram models

Question: How might we learn a causal language model?

Old Answer: Use n-grams!

Idea: Collect statistics about how frequent different n-grams are…

  1. the students opened their books
  2. the students opened their laptops
  3. the students opened their exams
  4. the students opened their minds
  5. the students opened their refrigerators

…and use these to predict the next word when we see the phrase “the students opened their…”.

\[P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their } w)}{\text{count}(\text{students opened their})}\]

Note: n-gram models require the Markov assumption: the word at \(w_{t+1}\) depends only on the preceding \(n-1\) words.

Generative n-gram models

Suppose we are learning a 4-gram language model.

as the examiner started the clock, the students opened their ____

\[P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their } w)}{\text{count}(\text{students opened their})}\]

Suppose we have a large corpus of text:

  • “students opened their” occurred 1000 times
  • “students opened their books” occurred 400 times
  • “students opened their laptops” occurred 300 times
  • “students opened their exams” occurred 150 times
  • “students opened their minds” occurred 140 times
  • “students opened their refrigerators” occurred 10 times
word \(p(w|\text{students opened their})\)
books .4
laptops .3
exams .15
minds .14
refrigerators .01
  1. We can use the n-gram counts to construct a probability distribution for the next word in the sequence
  2. An obvious rule is to select the word with the highest probability as our prediction for the next word (“books”)
  3. In this example, discarding the word “examiner” results in the wrong prediction.

Generative n-gram models

Weaknesses of n-gram language models:

  1. Sparsity

    • Training data may include very few, or no, occurances of relevant phrases/words
      • What if “students opened their \(w\)” doesn’t occur in the training data?
        • The probability of word \(w\) is zero.
      • What if “students opened their” doesn’t occur in the training data?
        • We can’t calculate the probability for any word.
  1. Limited memory

    • n-gram models can only look at a fixed window of ( n-1 ) previous words
    • They miss long-range dependencies
    • Increasing the size of the n-gram makes these sparsity problems worse
    • Creates a trade-off between model accuracy and sparsity
  1. No understanding of meaning

    • n-gram models rely on exact word matches
    • Cannot generalise across similar contexts or capture polysemy
    • “students opened their books” ≠ “pupils opened their textbooks”

\(\rightarrow\) n-gram models are good for clarifying the intuition behind a causal language model but are not very useful in practice

Generative large language models

Causal language modelling with Neural Models

Rather than computing the probability of the next word by computing the relative frequency of n-gram phrases, we can compute these probabilities using a neural language model

Neural language models have several advantages over n-gram models. Such models…

  1. … can handle much longer word histories
  2. … can better generalise over contexts of similar words
  3. … are much more accurate a next-word prediction tasks

Training LLMs

  • To train a transformer as a language model, we use a self-supervision algorithm, in which at each time step \(t\), the model predicts the next word in the sequence
  • We use cross-entropy as the loss function, where

    • \(y_t[w]\) is the true label for the next word (0 for all words except the true next word; 1 for the true next word)
    • \(\hat{y}_t[w]\) is the predicted probability the model gives to word \(w\) for the next word in the sequence

\[ L_{CE} = - \sum_{w \in V} y_t[w]log(\hat{y}_t[w]) \]

  • Since \(y_t[w]\) is zero for all words except the true next word (\(w_\text{true}\)), this simplifies to

\[ L_{CE} = -log(\hat{y}_t[w_\text{true}]) \]

  • Intuition:

    • When the model predicts a high-probability for the true next word, the loss will be small
    • When the model predicts a low-probability for the true next word, the loss will be large
  • We train the model to minimize this loss function via backpropagation

    • The loss decreases when the model assigns higher probability to the correct next word
    • The model aims to maximize the probability of the correct next word

Training Transformer LLMs

  1. Words/tokens are represented as embeddings which capture semantic similarities.
  1. Embeddings are passed through multiple hidden layers, which apply non-linear transformations to capture abstract features of the input.
  1. At each layer, the model uses self-attention to compute how much each word should “pay attention” to every other word in the sequence.
  1. The final transformer layer produces a probability distribution over the vocabulary \(V\).
  1. The predicted probability for the correct word is used to calculate the cross-entropy loss.
  1. The total loss is computed as the average cross-entropy loss across all tokens in the sequence.
  1. Using backpropagation and gradient descent, the model updates its weights to reduce the loss.

Autoregressive Decoding

Once we have trained a language model, we want to use it to generate text in response to a prompt.

Text generation works in an autoregressive way:

  1. The model is given a prompt (e.g. “The students opened their”)
  2. It uses this prompt to predict the most likely next word
  3. One word from the resulting probability distribution is added to the prompt, and the model predicts the next word again
  4. This continues one word at a time, until a stopping condition is reached (e.g. max length, stop token)

At each step, the model conditions on everything it has generated so far:

\[ P(w_{t+1} \mid w_1, w_2, \dots, w_t) \]

This means the model is not generating the full sentence all at once—it’s building it sequentially, based on what it’s already written.

Prompt: “The students opened their”

Steps:

\(w_{t+1} \sim p(w_{t+1}|\text{the}, \text{students}, \text{opened}, \text{their}) = \text{books}\)

\(w_{t+2} \sim p(w_{t+2}|\text{the}, \text{students}, \text{opened}, \text{their}, \text{books}) = \text{and}\)

\(w_{t+3} \sim p(w_{t+3}|\text{the}, \text{students}, \text{opened}, \text{their}, \text{books}, \text{and}) = \text{began}\)

etc

Output: “The students opened their books and began to revise for the exam.”

This sequential nature is important to understand when we talk about decoding methods like greedy decoding, sampling, and beam search.

Decoding methods

  • While the LLM provides probabilities over words in the vocabulary for the next word, there is no single approach for generating text from those probabilities.

  • The task of choosing a word to generate based on the model’s probabilities is called decoding.

  • A decoding method is a process that defines how the generated token sequence is derived from the probability estimates of the LLM.

Which word should come next?

The students opened their…

word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08

Decoding methods: Greedy decoding

Greedy decoding: at each time step in the generation, the output is the word in the vocabulary with the highest probability

\[\hat{w}_t = \text{argmax}_{w\in V}P(w|\mathbf{w}<t)\]

Problem: Text ends up being very repetitive and generic.

Which word should come next?

The students opened their…

word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08
word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08

Decoding methods: Random sampling

Random Sampling: at each time step in the generation, the output is randomly sampled from the probability distribution that arises from conditioning on previous words

\[w_t \sim P(w_t|\mathbf{w}<t)\]

Problem: There are a lot of rare words with individually low probabilities which will be sampled frequently thereby generating weird text

Which word should come next?

The students opened their…

word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08
refrigerators .07
dreams .02
cars .011
pancakes .007
galaxies .005
kangaroos .004
volcanoes .002
submarines .001
word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08
refrigerators .07
dreams .02
cars .011
pancakes .007
galaxies .005
kangaroos .004
volcanoes .002
submarines .001
word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08
refrigerators .07
dreams .02
cars .011
pancakes .007
galaxies .005
kangaroos .004
volcanoes .002
submarines .001
word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08
refrigerators .07
dreams .02
cars .011
pancakes .007
galaxies .005
kangaroos .004
volcanoes .002
submarines .001
word \(p(w_{t+1})\)
books .4
laptops .3
exams .1
minds .08
refrigerators .07
dreams .02
cars .011
pancakes .007
galaxies .005
kangaroos .004
volcanoes .002
submarines .001

Decoding methods: Top-K or Top-P sampling

Top-K Sampling: Truncate to the top \(k\) words by probability. Renormalize the remaining probabilities and then use random sampling

Top-P Sampling: Truncate to the top words whose cumulative probability is greater than some threshold, \(p\). Renormalize the remaining probabilities and then use random sampling.

Which word should come next?

The students opened their…

word \(p(w_{t+1})\)
books .5
laptops .375
exams .125
minds (excluded)
refrigerators (excluded)
dreams (excluded)
cars (excluded)
pancakes (excluded)
galaxies (excluded)
kangaroos (excluded)
volcanoes (excluded)
submarines (excluded)

Decoding methods: Temperature sampling

Temperature Sampling: Smoothly increase the probability of the most probable words and decrease the probability of less probable words.

The transformer model converts word logits into word probabilities using the softmax:

\[p(w_t) = softmax(\mu_w)\]

In termerature sampling, we first divide the logits by a hyperparameter, \(\tau\), which we set before generation:

\[p(w_t) = softmax(\mu/\tau)\]

This influences whether model outputs are more diverse (higher values of \(\tau\)) or more predictable (lower values of \(\tau\)).

Which word should come next?

The students opened their…

word \(p(w_{t}|\tau=1)\) \(p(w_{t}|\tau=.5)\) \(p(w_{t}|\tau=1.5)\)
books 0.39 0.73 0.26
laptops 0.17 0.15 0.15
exams 0.1 0.04 0.1
minds 0.09 0.04 0.1
refrigerators 0.04 0.01 0.05
dreams 0.04 0.01 0.05
cars 0.04 0.01 0.05
pancakes 0.03 0 0.05
galaxies 0.03 0 0.05
kangaroos 0.03 0 0.05
volcanoes 0.03 0 0.05
submarines 0.03 0 0.05
word \(p(w_{t}|\tau=1)\) \(p(w_{t}|\tau=.5)\) \(p(w_{t}|\tau=1.5)\)
books 0.39 0.73 0.26
laptops 0.17 0.15 0.15
exams 0.1 0.04 0.1
minds 0.09 0.04 0.1
refrigerators 0.04 0.01 0.05
dreams 0.04 0.01 0.05
cars 0.04 0.01 0.05
pancakes 0.03 0 0.05
galaxies 0.03 0 0.05
kangaroos 0.03 0 0.05
volcanoes 0.03 0 0.05
submarines 0.03 0 0.05
word \(p(w_{t}|\tau=1)\) \(p(w_{t}|\tau=.5)\) \(p(w_{t}|\tau=1.5)\)
books 0.39 0.73 0.26
laptops 0.17 0.15 0.15
exams 0.1 0.04 0.1
minds 0.09 0.04 0.1
refrigerators 0.04 0.01 0.05
dreams 0.04 0.01 0.05
cars 0.04 0.01 0.05
pancakes 0.03 0 0.05
galaxies 0.03 0 0.05
kangaroos 0.03 0 0.05
volcanoes 0.03 0 0.05
submarines 0.03 0 0.05
word \(p(w_{t}|\tau=1)\) \(p(w_{t}|\tau=.5)\) \(p(w_{t}|\tau=1.5)\)
books 0.39 0.73 0.26
laptops 0.17 0.15 0.15
exams 0.1 0.04 0.1
minds 0.09 0.04 0.1
refrigerators 0.04 0.01 0.05
dreams 0.04 0.01 0.05
cars 0.04 0.01 0.05
pancakes 0.03 0 0.05
galaxies 0.03 0 0.05
kangaroos 0.03 0 0.05
volcanoes 0.03 0 0.05
submarines 0.03 0 0.05

Decoding methods: Example

# Load packages and set API key
library(gemini.R)
google_key <- "XXXXXXXX"
setAPI(google_key)

# Define prompts for model
prompts <- c("What is the meaning of life",
             "What is the meaning of life? (Wrong answers only)",
             "Summarise the plot of Harry Potter in a paragraph.",
             "Explain the concept of Elon Musk.")

# Set temperature increments
temperatures <- seq(.1, 2, .1)

# Set up data.frame to store results
i <- 0
gemini_out <- data.frame(temperature = rep(NA, length(temperatures) * length(prompts)),
                  prompt = rep(NA, length(temperatures) * length(prompts)),
                  text = rep(NA, length(temperatures) * length(prompts)))

# Loop over prompts
for(p in prompts){
  print(p)
  # Loop over temperatures
  for(temp in temperatures){
    # Include a pause to avoid transgressing rate limits
    Sys.sleep(2.2)
    
    i <- i+1
    # Make call to gemini API and store results
    gemini_out$text[i] <- gemini(prompt = p, temperature = temp)
    gemini_out$prompt[i] <- p
    gemini_out$temperature[i] <- temp
    
  }
  
}
# Save
save(gemini_out, file = "gemini_out.Rdata")

Decoding methods: Example

gemini_out %>%
  filter(prompt == "What is the meaning of life? (Wrong answers only)" & 
           temperature == 0.1) %>%
  select(text) %>% as.character()
[1] "The meaning of life is clearly to collect as many bottle caps as possible, in preparation for the inevitable post-apocalyptic barter economy. Bonus points if they're Nuka-Cola caps.\n"
gemini_out %>%
  filter(prompt == "What is the meaning of life? (Wrong answers only)" & 
           temperature == 2) %>%
  select(text) %>% as.character()
[1] "Okay, here are some *definitely* wrong answers to the meaning of life:\n\n*   To collect as many lint bunnies as possible.\n*   To perfectly align your sock stripes every single day.\n*   To achieve the ultimate high score on Candy Crush.\n*   To become a professional competitive thumb wrestler.\n*   To successfully teach a cat to play the banjo.\n*   To prove definitively that pineapple DOES belong on pizza.\n*   To write the definitive fanfic of your life and get J.K. Rowling to endorse it.\n*   To hoard enough bottle caps to survive the apocalypse (when bottled water will be extinct, of course).\n*   To discover the hidden meaning behind airplane peanuts.\n*   To trip as many people as you can while going through revolving doors\n*   To be the most viewed contributor to the Flat Earth Society (the wrongness in that alone!)\n*   To have one single YouTube video about putting googly eyes on things go viral, catapulting you into mega-wealth.\n*   To find all the missing socks from the dryer, arrange them into an aesthetically pleasing pyramid, and achieve enlightenment.\n*   To make sure you use up every last hotel complimentary miniature bottle of shampoo and soap ever made.\n* To find Waldo!\n"

Decoding methods: Example

We can quantify the effects of changing the temperature parameter by calculating the entropy of each of the LLM-generated texts.

\[H = -\sum_{w \in V} P(w) \log P(w) \]

where \(P(w)\) is the probability of word \(w\) in the generated text.

  • A low entropy score means the text is highly repetitive and deterministic.
  • A high entropy score suggests more diverse word choices.

Post-training

Pre-training and Post-training

  • Pre-training: The initial phase of training a language model, where it learns by predicting the next word in huge volumes of unlabelled text

    • Pre-training allows the model to learn a broad knowledge of grammar, facts, and semantics
  • Post-training: Additional steps taken after the pre-training phase to adapt the model to specific tasks, such as following instructions or being helpful

    • Post-training allows the model to learn the types of responses that humans would like it to produce
    • Post-training helps turn predictive power into useful, aligned, and safe behaviour.

Post-training is a core part of the recent success of LLMs, particularly for chatbots.

The Post-Training Pipeline

After pre-training, LLMs typically go through several stages of post-training:

  1. Supervised Fine-Tuning
    • The model is trained on prompt-response pairs created by humans
    • Teaches the model how to follow instructions
  1. Reinforcement Learning from Human Feedback (RLHF)
    • The model generates multiple responses
    • Humans rank the responses
    • A reward model is trained on these rankings
    • The model is then fine-tuned using reinforcement learning to maximise the reward

Why Post-Training Matters for Social Science

  • Post-training shapes how models behave – including how they respond to prompts about politics, morality, or social norms

  • Key questions for researchers:

    • Whose preferences are encoded in human feedback?
    • How does post-training affect bias, trust, and transparency?
    • Do two models trained on the same data but different alignment objectives behave differently?
  • LLMs are not neutral. Post-training is where values, assumptions, and incentives enter the model pipeline.

Break

Please register for the Google Gemini Developer API before this afternoon’s seminars:

https://ai.google.dev/gemini-api/docs

Social science applications

Why Should Social Scientists Care about Generative Language Models?

  • Language modelling has become a benchmark test that helps us measure our progress on predicting language use

  • Language modelling is also now a subcomponent of many NLP tasks

    • Topic modelling
    • Document classification
    • Sentiment analysis
    • etc
  • Virtually all state-of-the-art natural language processing tools are based on language models of different types

  • However, the generative property of language models is only beginning to be explored in social science applications

Use Cases of Generative Language Models in Social Science

  1. Simulating Human Behaviour and Responses

    • Synthetic data generation: Create realistic simulations of human populations to test hypotheses, refine experimental designs, or conduct pilot studies efficiently.
  2. Adaptive Experimental Interventions

    • Personalized treatment texts: Generate tailored and context-aware messages that can dynamically adapt to individual characteristics.
    • Real-time engagement: Enable experiments to interactively adapt treatments or questions in response to participant behaviour.
  3. Investigating Latent Social Knowledge

    • Cultural and normative biases: Analyze generative outputs to uncover latent attitudes, stereotypes, and cultural norms encoded within LLMs.

The implication of these projects is that generative models may create entirely new opportunities for social science research.

Example 1: LLMs as Synthetic Human Samples

Research question: Can LLMs be used to simulate human samples in social science research?

  • LLMs are known to exhibit algorithmic biases – the tendency for models to replicate the racial, gender, economic, and other biases of the texts on which they are trained.

  • Can we use this property to generate text that resembles text produced by people with different characteristics?

  • Algorithmic fidelity is the property where, given basic human demographic background information, a model exhibits underlying patterns between concepts, ideas, and attitudes that mirror those recorded from humans with matching backgrounds.

Task One

  • Task: Ask GPT-3 to list four words about Democrats and Republicans, while pretending to be a Democrat or a Republican

    • Ask humans to evaluate these lists in terms of their content, their sentiment, their extremity, and whether they can predict the partisanship of the lists
  • Result: Very high-level of consistency in the evaluations of both human and GPT-3-generated lists in both content and tone.

Task Two

  • Task: Ask GPT-3 to generate probabilities of voting for a particular candidate, given a specific backstory.

    • E.g. calculate \(p(\text{trump}|\text{backstory})\) and \(p(\text{clinton}|\text{backstory})\)
  • Result: Very high correlation between voting behaviour reported by human respondents and that reported by GPT for those respondents.

How might we use these properties?

Researchers can leverage the insights gained from simulated, silicon samples to pilot different question wording, triage different types of measures, identify key relationships to evaluate more closely, and come up with analysis plans prior to collecting any data with human participants.

Though note that the performance of GPT to mimic human samples is limited:

  1. Less variation in responses than in real surveys
  2. Relationships between generated variables is often less accurate
  3. Distributions of synthetic responses are very sensitive to prompt wording
  4. Distributions of synthetic responses are very sensitive to exact model (e.g. GPT-3 vs. GPT-4)

Bisbee et al, Political Analysis, 2024

Example 2: Generating Adaptive Treatment Texts (I)

Research question: Can large language models be used to increase the persuasive effects of political microtargetting?

  • Microtargetted messages are those that try to direct messages in an individualised, personalised way to respondents with specific characteristics

  • The assumption behind microtargetting is that some people are more persuaded by some types of argument than others, so politcal actors can be more successful in persuasion by targetting their communications

  • Can we use LLM-generated arguments to generate effective personalised arguments?

Approach:

  • Gather data on demographic information about respondents

  • Prompt GPT-4 in real time to write persuasive texts on 4 political issues, of two types:

    1. Microtargetted messages, tailored to respondent characteristics
    2. Non-microtargetted messages
  • Measure support for a policy issue relative to a control group who see no message at all

Result:

  1. GPT-generated messages are persuasive, on average

    • Support increases by up to 12 percentage points, depending on the issue
  2. Micro-targetted messages are no more persuasive than non-targetted messages

    • A finding consistent with existing literature on politial persuasion

Micro-targetting prompt

You are a political persuasion expert specilizing in micro targeting techniques…Person X has the following attributes: [list attributes]. Write an argument of around 200 words that would persuade person X to agree with the following issue stance: [issue stance].

Non-micro-targetting prompt

You are a political persuasion expert… Write an argument of around 200 words that would persuade person X to agree with the following issue stance: [issue stance].

Example 3: Generating Adaptive Treatment Texts (II)

Research question: How do core elements of political discourse affect dialogue quality?

  • Online political discourse is often plagued by partisan animus, hostility, incivility, intolerance, and low quality exchange

  • Can we foster more productive dialogue by varying the ways in which people are interacted with in debate/deliberation?

Approach

  1. Ask survey respondents to write a social media post about a political issue with their position and the reason for their position.

  2. Get an LLM to generate counter-arguments on-the-fly, with reasoning tailored to the individual’s position and reasoning

    • Randomly vary three components of deliberation the LLM response:

      1. Disrespectful tone
      2. Partisanship
      3. Evidence-based argument
      4. Compromise
  3. Ask survey respondents to reply to the generated message and measure the quality of their replies

Results

  1. Arguments that are more evidence-based and signal a willingness to compromise receive higher quality responses

  2. Arguments that are more disrespectful receive lower quality responses

  3. Argument style also affects perceptions of the interlocutor (which, here, is an LLM!)

In political argument, you “get what you give”.

Challenges of Generative LLMs in Social Science Research

Generative LLMs offer powerful new tools for social science, but there are important limitations and challenges:

  1. Replicability concerns

    • LLMs (especially proprietary models like ChatGPT or Gemini) frequently change through updates and retraining
  1. Variation across prompts

    • All of the results above will depend hugely on exactly how you prompt the model
  1. Probabilistic outputs

    • LLMs generate text probabilistically, meaning results can vary across runs, even with identical prompts
  1. Ethical considerations

    • Models can produce biased, harmful, or misleading text
  1. Transparency

    • Commercial LLMs are “black boxes” meaning researchers have limited insight into exactly why certain outputs are produced
  1. Language bias

    • LLMs typically perform best in already dominant languages and performance and reliability decline significantly in other languages
    • This risks exacerbating existing inequitites in global research

Conclusion

Update your CV!

I have a comprehensive understanding of computational text analysis and practical expertise in applying advanced text-as-data methodologies to real-world social science research. I am proficient in preparing textual datasets, including web scraping and using APIs, and have hands-on experience with statistical and machine learning techniques such as topic modeling, supervised classification, word embeddings, and introductory skills with transformer-based models. Additionally, I am well-versed in evaluating the validity of text-based approaches and in designing research projects with text data. I have a strong analytical toolkit and practical experience in leveraging textual data effectively, making me well-prepared to apply these skills in academic, governmental, policy analysis, and industry settings.