10  Lanuage Models (IV): Generative Language Models in Social Science Research

10.1 Lecture slides

Open slides in a new window.

10.2 Introduction

The goal of today’s assignment is to help you to understand both the potential and the limits of generative LLMs as tools for social science research. Rather than using large language models to improve performance on tasks that we have already attempted using other methods – such as classification, scaling, topic modelling, etc – we will instead use them to generate text that can be used for downstream applications.

Recent research has demonstrated that large language models can be used as proxies for human populations in social science research, at least to the extent that when one prompts the model to take on the persona of an individual with a given set of background characteristics, the model will respond to survey prompts in a way that closely mimics the distribution of responses from real people with those characteristics.

We will use a large language model (LLM) to simulate open-ended survey responses from fictional individuals with different demographic profiles. You will prompt the model to generate a paragraph in the first person, describing the individual’s political beliefs. You will then analyze the generated text using the methods you have learned in previous weeks on the course.

10.3 Packages

library(tidyverse)
library(quanteda)
# install.packages("ellmer")
library(ellmer)
Warning: package 'ellmer' was built under R version 4.4.3
# install.packages("usethis")
library(usethis)

10.4 Gemini API

We will use Google’s Gemini model for this exercise.1 In order to use the Gemini API to make calls to the model, you will need to first register for access to the Gemini Developer API, which you can do by following this link (you will need a Google account). Once you have followed that link, you should see a screen like this:

1 Note that we could use essentially any LLM for this problem set. I have opted to use Gemini here because it is reasonably easy to generate the relevant API key, the Gemini class of models are competitive with other leading LLMs, and – crucially – it also comes with a generous free usage tier (with the downside that your data is used to improve the model).

Click “Get a Gemini API key”, then on the next screen click “Create API key” and follow the steps to generate your API key.

Once you have generated your API key, you will need to save it to the GOOGLE_API_KEY environment variable in your .Renviron file. For an easy way to find this file, use the following code:

usethis::edit_r_environ()

You then need to edit the .Renviron so that it looks like this:

Then save the file and restart R.2

2 Note that if you struggle to find/edit the .Renviron file, do not despair. You can provide your api key directly to the chat_gemini() function we use below (e.g. via chat_gemini(api_key = "your_api_key")) but this is not usually considered best practice because it exposes your API key directly in your scripts, which can accidentally lead to it being shared publicly (e.g., if you push your code to GitHub). Storing your API key securely in the .Renviron file keeps it private and ensures your scripts remain safe for sharing and collaboration.

10.5 Prompting the model

We will begin by getting Gemini to generate synthetic responses about a hypothetical person’s political opinions. To do this, we will use a new R wrapper for the Gemini API, which is available via the ellmer package. This wrapper provides a convenient interface for sending prompts to the model and retrieving its responses.

Below is an example of how to construct and send a prompt. The goal is to simulate a paragraph written in the first person by a fictional individual with a specific demographic profile. The model is instructed to avoid describing the person’s background and instead focus on their political beliefs, values, and the issues they see as most important in the UK today.

In this code, there are three main steps:

  1. First, we define the prompt, the text that we want to use to send to the LLM
  2. Second, we define a chat object using the chat_gemini function. Here we are telling the API which version of gemini we wish to use.
  3. Finally, we send the prompt to the API by using a chat() function which is attached to our my_chat object.
  1. Run the code below to generate a single example and inspect the response:
Reveal code
# Define the prompt to send to the model
prompt <- "Write a short paragraph in the first person, 
explaining your political beliefs. 

You are a 38-year-old man from the UK
with a highest qualification of postgraduate education. 
You voted for Labour in the last election. 

Do not describe yourself or your background -- 
focus only on your political beliefs.  

In particular, describe the values that guide your 
beliefs and the issues you think are most 
important facing the country today."

# Set up a chat object for interacting with the model
my_chat <- chat_google_gemini(model = "gemini-2.5-flash-lite")

# Sent the prompt to the model using the chat() function
output <- my_chat$chat(prompt)
I believe in a society where opportunity is not dictated by birthright but 
fostered through a strong social safety net and investment in public services. 
Fairness and equality are paramount to me; everyone deserves a chance to 
thrive, regardless of their background. This means ensuring access to excellent
education and healthcare, and a robust system of support for those facing 
hardship. I think the most pressing issues facing the UK are the deepening 
inequalities, the erosion of public services, and the urgent need for a just 
transition to a sustainable economy. These are not separate problems, but 
interconnected challenges that require bold, collective action to address.
  1. The ellmer chat() function retains the conversation’s context, which means that it will remember your previous interactions each time you use it for a new call. If instead we want the chat() function to forget previous interactions and start the conversation from scratch with each call to the LLM, we need to create a new chat object for each interaction.

    This is a bit annoying for our purposes, so instead we are going to create a function which will initialise a new chat object whenever we make a new call to Gemini.

run_interaction <- function(input) {
  # Create a new chat object for each interaction
  chat <- chat_google_gemini(
    model = "gemini-2.5-flash"
  )
  
  # Call chat() with your input
  response <- chat$chat(input)
  
  # Return the response
  return(response)
}

Use this function to send our prompt from above to this function instead. Note that the text will not in general be exactly the same as this is a separate call to the LLM and therefore different text may be generated due to the probabilistic nature of the text-generation process.

Reveal code
output <- run_interaction(prompt)
My political beliefs are anchored in a deep commitment to social justice, 
equality, and collective responsibility. I fundamentally believe that a 
compassionate society is one that ensures robust public services, cares for its
most vulnerable, and provides genuine opportunity for all, not just a 
privileged few. Currently, I see the escalating cost of living crisis, the 
underfunding of our NHS, and the existential threat of climate change as the 
most urgent challenges facing the UK. Addressing these demands not only 
economic fairness but also sustained investment in public infrastructure and a 
principled dedication to environmental stewardship for future generations.
  1. Adapt the prompt above, changing some of the characteristics of the individual. You might make them a different age, or give them a different voting history, or something else. How does this affect the generated text?
Reveal code
# Define the prompt to send to the model
prompt <- "Write a short paragraph in the first person, 
explaining your political beliefs. 

You are a 22-year-old woman from the UK
with a highest qualification of GCSEs. 
You voted for the Conservatives in the last election. 

Do not describe yourself or your background -- 
focus only on your political beliefs.  

In particular, describe the values that guide your 
beliefs and the issues you think are most 
important facing the country today."
   
# Generate output
output_2 <- run_interaction(prompt)
My political beliefs are really guided by a sense of personal responsibility 
and the importance of economic stability. I think people should be able to keep
more of what they earn, and that lower taxes and less government interference 
generally lead to more opportunity and growth for everyone. For the country, my
main concerns are definitely the economy and getting the cost of living under 
control. I also believe strongly in firm but fair immigration policies and 
making sure our public services are run as efficiently as possible, with every 
penny spent wisely. Tackling crime and ensuring our communities feel safe are 
also high priorities for me.

The LLM clearly seems to think that a 22-year old who voted Conservative and whose education only went to GCSE level would have quite different political beliefs to the 38-year old Labour voter from before. This makes sense, particularly given that past vote is likely to be strongly predictive of political attitudes.

10.6 Creating prompts programmatically

  1. Rather than manually editing the prompt each time we want to generate a response for a different synthetic sample, a more efficient approach would be to define a function which constructs the appropriate prompt from a given set of inputs.

    Create a function which takes four arguments:

    • age – the age of the hypothetical individual
    • gender – the gender of the hypothetical individual
    • education – the education of the hypothetical individual
    • vote – the vote of the hypothetical individual

    Use the paste0() function to construct the prompt such that the arguments defined above are put in the correct place in the surrounding prompt text. Your function should return a data.frame which includes the characteristics of the individual as well as the prompt you have created.

Reveal code
generate_prompt <- function(age, gender, education, vote){

  prompt <- paste0(
     "Write a short paragraph in the first person, explaining your political beliefs. ",
     "You are a ", age, "-year-old ", gender, " from the UK", 
     ", with a highest qualification of ", education, 
     ". You voted for ", vote, " in the last election. ",
     "Do not describe yourself or your background—focus only on your political beliefs. ",
     "In particular, describe the values that guide your beliefs and the issues you think are most important facing the country today."
   )
  
  out <- data.frame(age, gender, education, vote, prompt)
  
   return(out)
}
  1. Use the function you just created to create a prompt for an individual with some characteristics of your choice.
Reveal code
out <- generate_prompt(65, "male", "degree", "Reform") 
out$prompt
[1] "Write a short paragraph in the first person, explaining your political beliefs. You are a 65-year-old male from the UK, with a highest qualification of degree. You voted for Reform in the last election. Do not describe yourself or your background—focus only on your political beliefs. In particular, describe the values that guide your beliefs and the issues you think are most important facing the country today."
  1. Send your newly constructed prompt to the Gemini API and inspect the response. Does it make sense?
Reveal code
chat <- run_interaction(prompt)
I believe in the power of personal responsibility and hard work; that everyone 
should have the opportunity to get ahead through their own efforts. For me, 
it's about making sure the country is economically stable, because that's what 
underpins everything else. My biggest concerns right now are getting inflation 
under control so everyday life is more affordable, ensuring we have strong, 
secure borders, and making sure public money is spent wisely to deliver proper 
value for taxpayers. Ultimately, I want to see a future where hard work is 
rewarded and our country is secure and prosperous.

Again, it looks like the LLM sees an individual with these characteristics as having distinct political beliefs.

10.7 Generating “silicon” samples

In the previous sections, we manually experimented with prompting an LLM to generate political statements from hypothetical individuals with different characteristics. While this approach works well for single examples, in real social science research we often require larger synthetic datasets to study patterns systematically. Questions 6, 7, and 8 guide you through a more systematic approach to creating these datasets.

  1. First define a set of demographic characteristics that you want to vary across your simulated respondents. Store each set of characteristics as a separate vector, whose elements include the relevant values of the hypothetical characteristics.
Reveal code
ages <- 18:99
genders <- c("male", "female")
educations <- c("GCSE", "A Level", "Undergraduate", "Postgraduate")
votes <- c("Labour", "Conservative", "Reform")
  1. Generate a synthetic dataset containing a set of individuals whose characteristics are sampled at random from the vectors that you defined above. That is, each individual in your data should have a random age, gender, education and past vote. Construct your data for no more than 50 individuals.
Reveal code
n <- 20

silicon_sample <- data.frame(age = sample(ages, n, replace = T),
                             gender = sample(genders, n, replace = T),
                             education = sample(educations, n, replace = T),
                             vote = sample(votes, n, replace = T),
                             text = NA)
  1. Write a for-loop which a) generates a bespoke prompt for each of the synthetic individuals that you created in the questions above, b) sends that prompt to the gemini API, and c) saves the LLM-generated text in the text variable in the data.frame you created above. I’ve written some starter code for you below, but you will need to fill in the gaps.3

3 Note that there are rate limits on the Gemini API. For the Gemini 2.0 Flash-Lite model that we are using, there is currently a limit of 30 requests per minute and 1500 requests per day. In order to prevent us going over the requests per minute limit, I have included the Sys.sleep(2.2) command in the for-loop. This includes a 2.2 second pause in each iteration of the loop, ensuring that we don’t make more than 30 requests in any 60 second period.

Reveal code
i<-1
## Loop over rows of the silicon_sample data
for(i in 1:nrow(silicon_sample)){
  
  # Pause so as to avoid making too many API requests
  Sys.sleep(15)
  
  # Generate prompt for the relevant row of the silicon_sample data
  
  # Generate response from Gemini API
  
  # Store the output in the relevant row of the silicon_sample data
  
}

# Save the output
  1. Once you’ve generated your synthetic sample, take some time to explore the data:
  • Are the responses coherent and plausible?
  • Do they vary in tone, content, or emphasis based on demographics?
  • Are there recurring phrases or values that appear frequently?

10.8 Homework

  1. Presuming that you managed to collect the responses from gemini above, you should now do some analysis to demonstrate how gemini describes variation in political attitudes across one of the characteristics in the data. That is, using one of the methods we have studied on the course, tell me something interesting about the variation in political beliefs that gemini believes to exist between voters of different parties, or voters of different ages, or different educational backgrounds, and so on. Create one visualisation that communicates your finding and upload it with your code to this Moodle page.