10: Generative Language Models in Social Science Research

Jack Blumenau

Course Outline

Representing Text as Data (I): Bag-of-Words
Similarity, Difference, and Complexity
Language Models (I): Supervised Learning for Text Data
Language Models (II): Topic Models
Collecting Text Data
Causal Inference with Text
Representing Text as Data (II): Word Embeddings
Representing Text as Data (III): Word Sequences
Language Models (III): Neural Networks, Transfer Learning and Transformer Models
Language Models (IV): Generative Language Models in Social Science Research👈

Causal language modelling

Introduction

In last week’s lecture, we learned:

Neural networks are powerful approaches to learning complex, non-linear relationships between inputs and outputs
Neural language models represents words as embeddings, which are passed through hidden layers to predict the next word in a sequence
Transformer models include an attention mechanism, which allows the model to focus on the most relevant words in the input when making predictions

When we discussed applications, we mostly focused on transfer learning: using knowledge learned from one task to improve performance on a related task. E.g.

Masked-language modelling \(\rightarrow\) document classification
Next-sentence prediction \(\rightarrow\) Natural Language Inference

However, LLMs can also be used for generative or causal language modelling, something that is increasingly important in the social sciences.

Causal Language Modelling

Causal Language Modelling: the task of teaching an algorithm to predict/generate what comes next

the students opened their ____

the students opened their books (?)

the students opened their laptops (?)

the students opened their exams (?)

the students opened their minds (?)

the students opened their refrigerators (?)

More formally: given a sequence of words \(w_1, w_2, ..., w_t\), compute the probability distribution of the next word \(w_{t+1}\):

\[P\left(w_{t+1}|w_{t},...,w_{1}\right)\]

where \(w_{t+1}\) can be any word in the vocabulary \(V\).

Causal Language Modelling Applications

Causal language models will be very familiar to you!

Generative n-gram models

Question: How might we learn a causal language model?

Old Answer: Use n-grams!

Idea: Collect statistics about how frequent different n-grams are…

the students opened their books
the students opened their laptops
the students opened their exams
the students opened their minds
the students opened their refrigerators

…and use these to predict the next word when we see the phrase “the students opened their…”.

\[P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their } w)}{\text{count}(\text{students opened their})}\]

Note: n-gram models require the Markov assumption: the word at \(w_{t+1}\) depends only on the preceding \(n-1\) words.

Generative n-gram models

Suppose we are learning a 4-gram language model.

~~as the examiner started the clock, the~~ students opened their ____

\[P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their } w)}{\text{count}(\text{students opened their})}\]

Suppose we have a large corpus of text:

“students opened their” occurred 1000 times
“students opened their books” occurred 400 times
“students opened their laptops” occurred 300 times
“students opened their exams” occurred 150 times
“students opened their minds” occurred 140 times
“students opened their refrigerators” occurred 10 times

word	\(p(w\|\text{students opened their})\)
books	.4
laptops	.3
exams	.15
minds	.14
refrigerators	.01

We can use the n-gram counts to construct a probability distribution for the next word in the sequence
An obvious rule is to select the word with the highest probability as our prediction for the next word (“books”)
In this example, discarding the word “examiner” results in the wrong prediction.

Generative n-gram models

Weaknesses of n-gram language models:

Sparsity
- Training data may include very few, or no, occurances of relevant phrases/words
  - What if “students opened their \(w\)” doesn’t occur in the training data?
    - The probability of word \(w\) is zero.
  - What if “students opened their” doesn’t occur in the training data?
    - We can’t calculate the probability for any word.

Limited memory
- n-gram models can only look at a fixed window of ( n-1 ) previous words
- They miss long-range dependencies
- Increasing the size of the n-gram makes these sparsity problems worse
- Creates a trade-off between model accuracy and sparsity

No understanding of meaning
- n-gram models rely on exact word matches
- Cannot generalise across similar contexts or capture polysemy
- “students opened their books” ≠ “pupils opened their textbooks”

\(\rightarrow\) n-gram models are good for clarifying the intuition behind a causal language model but are not very useful in practice

Generative large language models

Causal language modelling with Neural Models

Rather than computing the probability of the next word by computing the relative frequency of n-gram phrases, we can compute these probabilities using a neural language model

Neural language models have several advantages over n-gram models. Such models…

… can handle much longer word histories
… can better generalise over contexts of similar words
… are much more accurate a next-word prediction tasks

Training LLMs

To train a transformer as a language model, we use a self-supervision algorithm, in which at each time step \(t\), the model predicts the next word in the sequence

We use cross-entropy as the loss function, where
- \(y_t[w]\) is the true label for the next word (0 for all words except the true next word; 1 for the true next word)
- \(\hat{y}_t[w]\) is the predicted probability the model gives to word \(w\) for the next word in the sequence

\[ L_{CE} = - \sum_{w \in V} y_t[w]log(\hat{y}_t[w]) \]

Since \(y_t[w]\) is zero for all words except the true next word (\(w_\text{true}\)), this simplifies to

\[ L_{CE} = -log(\hat{y}_t[w_\text{true}]) \]

Intuition:
- When the model predicts a high-probability for the true next word, the loss will be small
- When the model predicts a low-probability for the true next word, the loss will be large

We train the model to minimize this loss function via backpropagation
- The loss decreases when the model assigns higher probability to the correct next word
- The model aims to maximize the probability of the correct next word

Training Transformer LLMs

Words/tokens are represented as embeddings which capture semantic similarities.

Embeddings are passed through multiple hidden layers, which apply non-linear transformations to capture abstract features of the input.

At each layer, the model uses self-attention to compute how much each word should “pay attention” to every other word in the sequence.

The final transformer layer produces a probability distribution over the vocabulary \(V\).

The predicted probability for the correct word is used to calculate the cross-entropy loss.

The total loss is computed as the average cross-entropy loss across all tokens in the sequence.

Using backpropagation and gradient descent, the model updates its weights to reduce the loss.

Autoregressive Decoding

Once we have trained a language model, we want to use it to generate text in response to a prompt.

Text generation works in an autoregressive way:

The model is given a prompt (e.g. “The students opened their”)
It uses this prompt to predict the most likely next word
One word from the resulting probability distribution is added to the prompt, and the model predicts the next word again
This continues one word at a time, until a stopping condition is reached (e.g. max length, stop token)

At each step, the model conditions on everything it has generated so far:

\[ P(w_{t+1} \mid w_1, w_2, \dots, w_t) \]

This means the model is not generating the full sentence all at once—it’s building it sequentially, based on what it’s already written.

Prompt: “The students opened their”

Steps:

\(w_{t+1} \sim p(w_{t+1}|\text{the}, \text{students}, \text{opened}, \text{their}) = \text{books}\)

\(w_{t+2} \sim p(w_{t+2}|\text{the}, \text{students}, \text{opened}, \text{their}, \text{books}) = \text{and}\)

\(w_{t+3} \sim p(w_{t+3}|\text{the}, \text{students}, \text{opened}, \text{their}, \text{books}, \text{and}) = \text{began}\)

etc

Output: “The students opened their books and began to revise for the exam.”

This sequential nature is important to understand when we talk about decoding methods like greedy decoding, sampling, and beam search.

Decoding methods

While the LLM provides probabilities over words in the vocabulary for the next word, there is no single approach for generating text from those probabilities.
The task of choosing a word to generate based on the model’s probabilities is called decoding.
A decoding method is a process that defines how the generated token sequence is derived from the probability estimates of the LLM.

Which word should come next?

The students opened their…

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08

Decoding methods: Greedy decoding

Greedy decoding: at each time step in the generation, the output is the word in the vocabulary with the highest probability

\[\hat{w}_t = \text{argmax}_{w\in V}P(w|\mathbf{w}<t)\]

Problem: Text ends up being very repetitive and generic.

Which word should come next?

The students opened their…

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08

Decoding methods: Random sampling

Random Sampling: at each time step in the generation, the output is randomly sampled from the probability distribution that arises from conditioning on previous words

\[w_t \sim P(w_t|\mathbf{w}<t)\]

Problem: There are a lot of rare words with individually low probabilities which will be sampled frequently thereby generating weird text

Which word should come next?

The students opened their…

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08
refrigerators	.07
dreams	.02
cars	.011
pancakes	.007
galaxies	.005
kangaroos	.004
volcanoes	.002
submarines	.001

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08
refrigerators	.07
dreams	.02
cars	.011
pancakes	.007
galaxies	.005
kangaroos	.004
volcanoes	.002
submarines	.001

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08
refrigerators	.07
dreams	.02
cars	.011
pancakes	.007
galaxies	.005
kangaroos	.004
volcanoes	.002
submarines	.001

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08
refrigerators	.07
dreams	.02
cars	.011
pancakes	.007
galaxies	.005
kangaroos	.004
volcanoes	.002
submarines	.001

word	\(p(w_{t+1})\)
books	.4
laptops	.3
exams	.1
minds	.08
refrigerators	.07
dreams	.02
cars	.011
pancakes	.007
galaxies	.005
kangaroos	.004
volcanoes	.002
submarines	.001

Decoding methods: Top-K or Top-P sampling

Top-K Sampling: Truncate to the top \(k\) words by probability. Renormalize the remaining probabilities and then use random sampling

Top-P Sampling: Truncate to the top words whose cumulative probability is greater than some threshold, \(p\). Renormalize the remaining probabilities and then use random sampling.

Which word should come next?

The students opened their…

word	\(p(w_{t+1})\)
books	.5
laptops	.375
exams	.125
minds	(excluded)
refrigerators	(excluded)
dreams	(excluded)
cars	(excluded)
pancakes	(excluded)
galaxies	(excluded)
kangaroos	(excluded)
volcanoes	(excluded)
submarines	(excluded)

Decoding methods: Temperature sampling

Temperature Sampling: Smoothly increase the probability of the most probable words and decrease the probability of less probable words.

The transformer model converts word logits into word probabilities using the softmax:

\[p(w_t) = softmax(\mu_w)\]

In termerature sampling, we first divide the logits by a hyperparameter, \(\tau\), which we set before generation:

\[p(w_t) = softmax(\mu/\tau)\]

This influences whether model outputs are more diverse (higher values of \(\tau\)) or more predictable (lower values of \(\tau\)).

Which word should come next?

The students opened their…

word	\(p(w_{t}\|\tau=1)\)	\(p(w_{t}\|\tau=.5)\)	\(p(w_{t}\|\tau=1.5)\)
books	0.39	0.73	0.26
laptops	0.17	0.15	0.15
exams	0.1	0.04	0.1
minds	0.09	0.04	0.1
refrigerators	0.04	0.01	0.05
dreams	0.04	0.01	0.05
cars	0.04	0.01	0.05
pancakes	0.03	0	0.05
galaxies	0.03	0	0.05
kangaroos	0.03	0	0.05
volcanoes	0.03	0	0.05
submarines	0.03	0	0.05

word	\(p(w_{t}\|\tau=1)\)	\(p(w_{t}\|\tau=.5)\)	\(p(w_{t}\|\tau=1.5)\)
books	0.39	0.73	0.26
laptops	0.17	0.15	0.15
exams	0.1	0.04	0.1
minds	0.09	0.04	0.1
refrigerators	0.04	0.01	0.05
dreams	0.04	0.01	0.05
cars	0.04	0.01	0.05
pancakes	0.03	0	0.05
galaxies	0.03	0	0.05
kangaroos	0.03	0	0.05
volcanoes	0.03	0	0.05
submarines	0.03	0	0.05

word	\(p(w_{t}\|\tau=1)\)	\(p(w_{t}\|\tau=.5)\)	\(p(w_{t}\|\tau=1.5)\)
books	0.39	0.73	0.26
laptops	0.17	0.15	0.15
exams	0.1	0.04	0.1
minds	0.09	0.04	0.1
refrigerators	0.04	0.01	0.05
dreams	0.04	0.01	0.05
cars	0.04	0.01	0.05
pancakes	0.03	0	0.05
galaxies	0.03	0	0.05
kangaroos	0.03	0	0.05
volcanoes	0.03	0	0.05
submarines	0.03	0	0.05

word	\(p(w_{t}\|\tau=1)\)	\(p(w_{t}\|\tau=.5)\)	\(p(w_{t}\|\tau=1.5)\)
books	0.39	0.73	0.26
laptops	0.17	0.15	0.15
exams	0.1	0.04	0.1
minds	0.09	0.04	0.1
refrigerators	0.04	0.01	0.05
dreams	0.04	0.01	0.05
cars	0.04	0.01	0.05
pancakes	0.03	0	0.05
galaxies	0.03	0	0.05
kangaroos	0.03	0	0.05
volcanoes	0.03	0	0.05
submarines	0.03	0	0.05

Decoding methods: Beam Search

Beam Search: Beam search keeps track of multiple high-probability sequences (beams) and expands them step by step.

At each time step, the model considers the top-k most probable continuations (where \(k\) is the beam width).
The most probable sequences are retained, and the process repeats until reaching a stopping criterion (e.g., max length or an end token).

Which sequence should continue?
The students opened their…

Beam width ( k = 3 )

Sequence	Cumulative Probability
The students opened their books	\(0.4\)
The students opened their laptops	\(0.3\)
The students opened their exams	\(0.1\)
The students opened their minds	\(0.08\)
The students opened their refrigerators	\(0.07\)
The students opened their dreams	\(0.02\)

Sequence	Cumulative Probability
The students opened their books	\(0.4\)
The students opened their laptops	\(0.3\)
The students opened their exams	\(0.1\)
The students opened their minds	~~\(0.08\)~~
The students opened their refrigerators	~~\(0.07\)~~
The students opened their dreams	~~\(0.02\)~~

Sequence	Cumulative Probability
The students opened their books and	\(0.4 * 0.6 = 0.24\)
The students opened their laptops and	\(0.3 * 0.5 = 0.15\)
The students opened their exams and	\(0.1 * 0.8 = 0.08\)
The students opened their books but	\(0.4 * 0.05 = 0.02\)
The students opened their laptops but	\(0.3 * 0.02 = 0.006\)
The students opened their exams but	\(0.1 * 0.07 = 0.007\)

Sequence	Cumulative Probability
The students opened their books and	\(0.4 * 0.6 = 0.24\)
The students opened their laptops and	\(0.3 * 0.5 = 0.15\)
The students opened their exams and	\(0.1 * 0.8 = 0.08\)
The students opened their books but	~~\(0.4 * 0.05 = 0.02\)~~
The students opened their laptops but	~~\(0.3 * 0.02 = 0.006\)~~
The students opened their exams but	~~\(0.1 * 0.07 = 0.007\)~~

Sequence	Cumulative Probability
The students opened their books and began	\(0.24 * 0.6 = 0.144\)
The students opened their books and then	\(0.24 * 0.2 = 0.048\)
The students opened their laptops and took	\(0.15 * 0.5 = 0.075\)
The students opened their laptops and started	\(0.15 * 0.8 = 0.12\)
The students opened their exams and started	\(0.08 * 0.5 = 0.04\)
The students opened their exams and then	\(0.08 * 0.2 = 0.016\)

Sequence	Cumulative Probability
The students opened their books and began	\(0.24 * 0.6 = 0.144\)
The students opened their books and then	~~\(0.24 * 0.2 = 0.048\)~~
The students opened their laptops and took	\(0.15 * 0.5 = 0.075\)
The students opened their laptops and started	\(0.15 * 0.8 = 0.12\)
The students opened their exams and started	~~\(0.08 * 0.5 = 0.04\)~~
The students opened their exams and then	~~\(0.08 * 0.2 = 0.016\)~~

Problem:

Beam search is computationally expensive compared to greedy decoding.
Beam search often leads to overly generic and repetitive responses because it prioritises high-probability words.

Decoding methods: Example

# Load packages and set API key
library(gemini.R)
google_key <- "XXXXXXXX"
setAPI(google_key)

# Define prompts for model
prompts <- c("What is the meaning of life",
             "What is the meaning of life? (Wrong answers only)",
             "Summarise the plot of Harry Potter in a paragraph.",
             "Explain the concept of Elon Musk.")

# Set temperature increments
temperatures <- seq(.1, 2, .1)

# Set up data.frame to store results
i <- 0
gemini_out <- data.frame(temperature = rep(NA, length(temperatures) * length(prompts)),
                  prompt = rep(NA, length(temperatures) * length(prompts)),
                  text = rep(NA, length(temperatures) * length(prompts)))

# Loop over prompts
for(p in prompts){
  print(p)
  # Loop over temperatures
  for(temp in temperatures){
    # Include a pause to avoid transgressing rate limits
    Sys.sleep(2.2)
    
    i <- i+1
    # Make call to gemini API and store results
    gemini_out$text[i] <- gemini(prompt = p, temperature = temp)
    gemini_out$prompt[i] <- p
    gemini_out$temperature[i] <- temp
    
  }
  
}
# Save
save(gemini_out, file = "gemini_out.Rdata")

Decoding methods: Example

gemini_out %>%
  filter(prompt == "What is the meaning of life? (Wrong answers only)" & 
           temperature == 0.1) %>%
  select(text) %>% as.character()

[1] "The meaning of life is clearly to collect as many bottle caps as possible, in preparation for the inevitable post-apocalyptic barter economy. Bonus points if they're Nuka-Cola caps.\n"

gemini_out %>%
  filter(prompt == "What is the meaning of life? (Wrong answers only)" & 
           temperature == 2) %>%
  select(text) %>% as.character()

[1] "Okay, here are some *definitely* wrong answers to the meaning of life:\n\n*   To collect as many lint bunnies as possible.\n*   To perfectly align your sock stripes every single day.\n*   To achieve the ultimate high score on Candy Crush.\n*   To become a professional competitive thumb wrestler.\n*   To successfully teach a cat to play the banjo.\n*   To prove definitively that pineapple DOES belong on pizza.\n*   To write the definitive fanfic of your life and get J.K. Rowling to endorse it.\n*   To hoard enough bottle caps to survive the apocalypse (when bottled water will be extinct, of course).\n*   To discover the hidden meaning behind airplane peanuts.\n*   To trip as many people as you can while going through revolving doors\n*   To be the most viewed contributor to the Flat Earth Society (the wrongness in that alone!)\n*   To have one single YouTube video about putting googly eyes on things go viral, catapulting you into mega-wealth.\n*   To find all the missing socks from the dryer, arrange them into an aesthetically pleasing pyramid, and achieve enlightenment.\n*   To make sure you use up every last hotel complimentary miniature bottle of shampoo and soap ever made.\n* To find Waldo!\n"

Decoding methods: Example

We can quantify the effects of changing the temperature parameter by calculating the entropy of each of the LLM-generated texts.

\[H = -\sum_{w \in V} P(w) \log P(w) \]

where \(P(w)\) is the probability of word \(w\) in the generated text.

A low entropy score means the text is highly repetitive and deterministic.
A high entropy score suggests more diverse word choices.

Post-training

Pre-training and Post-training

Pre-training: The initial phase of training a language model, where it learns by predicting the next word in huge volumes of unlabelled text
- Pre-training allows the model to learn a broad knowledge of grammar, facts, and semantics
Post-training: Additional steps taken after the pre-training phase to adapt the model to specific tasks, such as following instructions or being helpful
- Post-training allows the model to learn the types of responses that humans would like it to produce
- Post-training helps turn predictive power into useful, aligned, and safe behaviour.

Post-training is a core part of the recent success of LLMs, particularly for chatbots.

The Post-Training Pipeline

After pre-training, LLMs typically go through several stages of post-training:

Supervised Fine-Tuning
- The model is trained on prompt-response pairs created by humans
- Teaches the model how to follow instructions

Reinforcement Learning from Human Feedback (RLHF)
- The model generates multiple responses
- Humans rank the responses
- A reward model is trained on these rankings
- The model is then fine-tuned using reinforcement learning to maximise the reward

Break

Please register for the Google Gemini Developer API before this afternoon’s seminars:

https://ai.google.dev/gemini-api/docs

Example 1: LLMs as Synthetic Human Samples

Research question: Can LLMs be used to simulate human samples in social science research?

LLMs are known to exhibit algorithmic biases – the tendency for models to replicate the racial, gender, economic, and other biases of the texts on which they are trained.
Can we use this property to generate text that resembles text produced by people with different characteristics?
Algorithmic fidelity is the property where, given basic human demographic background information, a model exhibits underlying patterns between concepts, ideas, and attitudes that mirror those recorded from humans with matching backgrounds.

Task One

Task: Ask GPT-3 to list four words about Democrats and Republicans, while pretending to be a Democrat or a Republican
- Ask humans to evaluate these lists in terms of their content, their sentiment, their extremity, and whether they can predict the partisanship of the lists
Result: Very high-level of consistency in the evaluations of both human and GPT-3-generated lists in both content and tone.

Task Two

Task: Ask GPT-3 to generate probabilities of voting for a particular candidate, given a specific backstory.
- E.g. calculate \(p(\text{trump}|\text{backstory})\) and \(p(\text{clinton}|\text{backstory})\)
Result: Very high correlation between voting behaviour reported by human respondents and that reported by GPT for those respondents.

How might we use these properties?

Researchers can leverage the insights gained from simulated, silicon samples to pilot different question wording, triage different types of measures, identify key relationships to evaluate more closely, and come up with analysis plans prior to collecting any data with human participants.

Though note that the performance of GPT to mimic human samples is limited:

Less variation in responses than in real surveys
Relationships between generated variables is often less accurate
Distributions of synthetic responses are very sensitive to prompt wording
Distributions of synthetic responses are very sensitive to exact model (e.g. GPT-3 vs. GPT-4)

Bisbee et al, Political Analysis, 2024

Example 2: Generating Adaptive Treatment Texts (I)

Research question: Can large language models be used to increase the persuasive effects of political microtargetting?

Microtargetted messages are those that try to direct messages in an individualised, personalised way to respondents with specific characteristics
The assumption behind microtargetting is that some people are more persuaded by some types of argument than others, so politcal actors can be more successful in persuasion by targetting their communications
Can we use LLM-generated arguments to generate effective personalised arguments?

Approach:

Gather data on demographic information about respondents
Prompt GPT-4 in real time to write persuasive texts on 4 political issues, of two types:
1. Microtargetted messages, tailored to respondent characteristics
2. Non-microtargetted messages
Measure support for a policy issue relative to a control group who see no message at all

Result:

GPT-generated messages are persuasive, on average
- Support increases by up to 12 percentage points, depending on the issue
Micro-targetted messages are no more persuasive than non-targetted messages
- A finding consistent with existing literature on politial persuasion

Micro-targetting prompt

You are a political persuasion expert specilizing in micro targeting techniques…Person X has the following attributes: [list attributes]. Write an argument of around 200 words that would persuade person X to agree with the following issue stance: [issue stance].

Non-micro-targetting prompt

You are a political persuasion expert… Write an argument of around 200 words that would persuade person X to agree with the following issue stance: [issue stance].

Example 3: Generating Adaptive Treatment Texts (II)

Research question: How do core elements of political discourse affect dialogue quality?

Online political discourse is often plagued by partisan animus, hostility, incivility, intolerance, and low quality exchange
Can we foster more productive dialogue by varying the ways in which people are interacted with in debate/deliberation?

Approach

Ask survey respondents to write a social media post about a political issue with their position and the reason for their position.
Get an LLM to generate counter-arguments on-the-fly, with reasoning tailored to the individual’s position and reasoning
- Randomly vary three components of deliberation the LLM response:
  1. Disrespectful tone
  2. Partisanship
  3. Evidence-based argument
  4. Compromise
Ask survey respondents to reply to the generated message and measure the quality of their replies

Results

Arguments that are more evidence-based and signal a willingness to compromise receive higher quality responses
Arguments that are more disrespectful receive lower quality responses
Argument style also affects perceptions of the interlocutor (which, here, is an LLM!)

In political argument, you “get what you give”.

Conclusion

Update your CV!

I have a comprehensive understanding of computational text analysis and practical expertise in applying advanced text-as-data methodologies to real-world social science research. I am proficient in preparing textual datasets, including web scraping and using APIs, and have hands-on experience with statistical and machine learning techniques such as topic modeling, supervised classification, word embeddings, and introductory skills with transformer-based models. Additionally, I am well-versed in evaluating the validity of text-based approaches and in designing research projects with text data. I have a strong analytical toolkit and practical experience in leveraging textual data effectively, making me well-prepared to apply these skills in academic, governmental, policy analysis, and industry settings.

10: Generative Language Models in Social Science Research

Course Outline

Causal language modelling

Introduction

Causal Language Modelling

Causal Language Modelling Applications

Generative n-gram models

Generative n-gram models

Generative n-gram models

Generative n-gram models

Generative large language models

Causal language modelling with Neural Models

Training LLMs

Training Transformer LLMs

Autoregressive Decoding

Decoding methods

Decoding methods: Greedy decoding

Decoding methods: Random sampling

Decoding methods: Top-K or Top-P sampling

Decoding methods: Temperature sampling

Decoding methods: Beam Search

Decoding methods: Example

Decoding methods: Example

Decoding methods: Example

Post-training

Pre-training and Post-training

The Post-Training Pipeline

Why Post-Training Matters for Social Science

Break

Social science applications

Why Should Social Scientists Care about Generative Language Models?

Use Cases of Generative Language Models in Social Science

Example 1: LLMs as Synthetic Human Samples

Example 2: Generating Adaptive Treatment Texts (I)

Example 3: Generating Adaptive Treatment Texts (II)

Challenges of Generative LLMs in Social Science Research

Conclusion

Update your CV!