Review an Existing Text-as-Data Application (25%)
Application of a Text-as-Data Method (75%)
Review an Existing Text-as-Data Application (30%)
Application of a Text-as-Data Method (70%)
Note: Points in italics are only required for PGT students.
Identify an interesting question
Search for data that can answer that question
Answer the question
Likely to lead to more interesting questions
Likely to lead to less time exploring different methods
Potentially large amount of time spent searching for/collecting data
Potential risk of finding nothing
Identify an interesting dataset
Explore that dataset using some of the tools we cover on the course
Construct a research question that you can answer using that data
Answer the question
Lower frustration
Potentially faster
Potentially less interesting research question/answers
Often more limited metadata
Be clear about the concept that you are trying to measure
Discuss the assumptions behind your chosen method
Provide some form of validation for your measure. E.g.
Demonstrate something interesting with the concept that you have measured
Journal articles
Good starting points:
Read news articles
Discuss potential applications with classmates
Good starting points:
But there are many more!
The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.
Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper
This is an excellent source of data for your projects!
Sometimes it can take a bit of searching through the files in each repository to figure out where the data is
Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)
Many of these datasets are potentially interesting to social scientists, e.g.
Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on
API: Application Programming Interface — a way for two pieces of software to talk to each other
Your software can receive (and also send) data automatically through these services
Data is sent by — the same way your browser does it
Most services have helping code (known as a wrapper) to construct http requests
Both the wrapper and the service itself are called APIs
http service also sometimes known as REST (Representational State Transfer)
Source: GeeksforGeeks
APIs typically require you to register for an API key to allow access
Before you commit to using a given API, check what the rate limits are on its use
Make sure you register with the service in plenty of time to actually get the data!
Once registered, you will have access to some kind of key that will allow you to access the API
http requestsIt is helpful to start paying attention to the structure of basic http requests.
For instance, let’s say we want to get some data from the TheyWorkForYou api.
A test request:
https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX
Parameters to the API are encoded in the URL
output = Which format do you want returned?search = Return speeches with which words?num = number requestedkey = access keyThe output of an API will typically not be in csv or Rdata format
Often, though not always, it will be in either JSON and XML
XML: eXtensible Markup Language
JSON : JavaScript Object Notation
If you have a choice, you probably want JSON
Both types of file are easily read into R
json_lite and xml2 are the relevant packages
It’s not usually necessary to construct these kind of requests yourself
R, Python, and other programming languages have libraries to make it easier – but you have to find them!
I have provided a sample of APIs that have associated R packages on the next slide
The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.
There are many existing R packages that make it straightforward to retreive data from an API:
| API | R package | Description |
|---|---|---|
install.packages("rtweet") |
Twitter, small-scale use, no longer free! | |
| Guardian Newspaper | install.packages("guardianapi") |
Full Guardian archive, 1999-present |
| Wikipedia | install.packages("WikipediR") |
Wikipedia data and knowledge graph |
| TheyWorkForYou | install.packages("twfy") |
Speeches from the UK House of Commons and Lords |
| ProPublica Congress API | install.packages("ProPublicaR") |
Data from the US Congress |
| Google Books Ngrams | install.packages("ngramr") |
Ngrams in Google Books, 1500-present |
install.packages("RedditExtractoR") |
Subreddits, users, urls, texts of posts |
Warning: I have not tested all of these!
We will use the Reddit API to search for subreddits on UK Politics in the past year
For this example, we are not collecting a large amount of data
In general, you need to created a authenticated client id
Rate limits: currently 100 queries per minute (QPM) per OAuth client id
We will use library(RedditExtractoR)
Rows: 233
Columns: 7
$ id <chr> "483mxu", "4rfocy", "3g37x", "2r7v0", "jrrb9", "3ahdz", "6…
$ date_utc <chr> "2021-04-08", "2021-07-15", "2016-08-30", "2009-09-23", "2…
$ timestamp <dbl> 1617865911, 1626368162, 1472562586, 1253680171, 1527610526…
$ subreddit <chr> "Divisive_Babble", "SteamDeck", "brandonlawson", "Academic…
$ title <chr> "Divisive Babble", "Steam Deck", "The Search For Brandon L…
$ description <chr> "This is a forum for the discussion of politics, current a…
$ subscribers <dbl> 1164, 1034951, 5667, 58183, 91, 74, 29290, 9842, 1244195, …
Let’s clean the dataset and order those subreddits by number of subscribers.
subred <- subred %>%
# selects variables you want to explore
select(subreddit, title, description, subscribers) %>%
# creates new variables
mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>%
# arranges data from highest subscriber count
arrange(desc(subscribers_million))
head(subred[c("subreddit", "title", "subscribers_million")]) subreddit title subscribers_million
2qh1i AskReddit Ask Reddit... 57.322329
2qh13 worldnews World News 46.913243
2qjpg memes /r/Memes the original since 2008 35.533307
2qh55 food Food Photos on Reddit 24.381233
2qh4j europe Europe 11.460773
2cneq politics Politics 8.966519
Many are not about UK politics!
Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.
subreddit title subscribers_million
2qhcv ukpolitics UK Politics 0.532235
30c1v LabourUK The British Labour Party 0.069528
tzpe1 UKPoliticalComedy UK Political Comedy 0.059908
33geh UK_Politics UK Politics 0.005360
2qo8i PoliticsUK UK Politics Discussion 0.003332
4th4dw AskUKPolitics AskUKPolitics 0.000642
The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics
Let’s have a look at these two:
What is in this data?
Rows: 248
Columns: 7
$ date_utc <chr> "2021-11-10", "2023-01-27", "2022-10-03", "2021-03-20", "202…
$ timestamp <dbl> 1636530752, 1674848275, 1664830422, 1616259328, 1615137661, …
$ title <chr> "New Conservatives logo", "Time to light the fire under Mr Z…
$ text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalComedy…
$ comments <dbl> 3, 1, 5, 7, 6, 0, 13, 8, 1, 0, 12, 7, 1, 8, 4, 11, 36, 1, 28…
$ url <chr> "https://www.reddit.com/r/UKPoliticalComedy/comments/qqp9c0/…
Rows: 228
Columns: 7
$ date_utc <chr> "2019-03-18", "2017-09-10", "2021-01-06", "2019-05-02", "201…
$ timestamp <dbl> 1552923461, 1505071469, 1609942152, 1556832034, 1575982428, …
$ title <chr> "BREAKING: Speaker rules out the government bringing back me…
$ text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments <dbl> 1061, 871, 354, 170, 771, 129, 640, 1648, 681, 569, 252, 575…
$ url <chr> "https://www.reddit.com/r/ukpolitics/comments/b2k5i1/breakin…
We will work on the titles of those threads to measure sentiment with a dictionary based approach
make.dfm <- function(data){
dfm <- data %>%
corpus(text_field = "title") %>% # treat the "title" column as documents
tokens(remove_punct = TRUE, # split into words, dropping punctuation,
remove_symbols = TRUE, # symbols like £ or @,
remove_url = TRUE) %>% # and URLs
dfm() %>% # convert to document-feature matrix (word counts)
dfm_remove(stopwords("en")) %>% # drop English stopwords (the, is, and...)
dfm_trim(min_termfreq = 3, # drop words appearing fewer than 3 times total
max_docfreq = .9, # and words in more than 90% of docs (too common)
docfreq_type = "prop") %>%
dfm_select(pattern = "\\b\\w{3,}\\b", # keep only words with 3+ characters
valuetype = "regex",
selection = "keep")
return(dfm)
}
comedy.dfm <- make.dfm(uk.comedy) # build DFM for comedy subreddit posts
politics.dfm <- make.dfm(uk.politics) # build DFM for politics subreddit postsHow many features in those DFMs?
Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015
dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015) %>% # check matches to Lexicoder sentiment dictionary (pos/neg words)
dfm_remove(c("neg_positive", "neg_negative")) %>% # drop negated categories (e.g. "not good" counted as neg_positive)
dfm_weight(scheme = "logave") %>% # weight by log-average to normalise for document length
convert("data.frame") %>% # convert DFM to a regular dataframe
mutate(doc_id=NULL, # drop the document ID column
positive = trunc(positive), # truncate decimals to whole numbers
negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>% # TRUE/FALSE: is the doc equally pos and neg?
colMeans(na.rm = TRUE) %>% # average each column across all documents
print() negative positive neutral
0.01209677 0.11693548 0.87096774
dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>%
dfm_remove(c("neg_positive", "neg_negative")) %>%
dfm_weight(scheme = "logave") %>%
convert("data.frame") %>%
mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>%
colMeans(na.rm = TRUE) %>% print() negative positive neutral
0.08771930 0.09649123 0.82456140
You could use those functions to explore:
The correlation between comment sentiment and upvotes/downvotes
Topics across sub-reddits
How conversation evolve depending on the topic
… and many other research questions!
If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com
Key steps in any web-scraping project:
Work out how the website is structured
Work out how links connect different pages
Isolate the information you care about on each page
Write a loop which connects 3 to 2, and saves the information you want from each page
Put it all into a nice and tidy data.frame
Feel like a superhero 🪄
(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)
Web-scraping can be illegal in some circumstances
Web-scraping is more likely to be illegal when…
It is harmful to the source, e.g.,
It gathers data that is under copywrite/has privacy restrictions/used for financial gain
Even if not illegal, web-scraping can be ethically dubious. Especially when…
it is edging towards being illegal
the data is otherwise available via an API
it does not respect restrictions specified by the host website (often specified in a robots.txt file)
We will scrape the research interests of members of faculty in the Department of Political Science at UCL
The departmental website has a list of faculty members
Each member of the department has a unique page
The research interests of the faculty member are stored on their unique page
Let’s look at an exmple…
To collect the information we want, we need to see how it is stored within the html code that underpins the website
Webpages include much more than what is immediately visible to visitors
Crucially, they include code which provides structure, style and functionality (which your browser interprets)
HTML provides strucutrecss provides styleJavaScript provides functionalityTo implement a web-scraper, we have to work directly with the source code
To see the source code, use Ctrl + U or right click and select View/Show Page Source
We can read the source code of any website into R using the readLines() function.
The structured tags (<td>, <p>, <a>, <img>) are how content is organized:
<td> — a table cell containing each person’s information<a href="https://profiles.ucl.ac.uk/..."> — a clickable link to each person’s profile page<p> — paragraphs containing the research description<img> — the person’s photoThis is helpful, but it is awkward to navigate the source code directly.
The read_html function in the rvest package allows us to read the HTML in a more structured format:
{html_document}
<html lang="en" dir="ltr" prefix="og: https://ogp.me/ns#">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="colour-scheme--gordon-glow">\n <a href="#main-content ...
We can then navigate through the HTML by searching for elements that have common elements (using html_elements()):
{xml_nodeset (5)}
[1] <a href="https://profiles.ucl.ac.uk/9520-rod-abouharb" rel="nofollow"><st ...
[2] <a href="https://profiles.ucl.ac.uk/72700-valentina-amuso" rel="nofollow" ...
[3] <a href="https://profiles.ucl.ac.uk/86078-samer-anabtawi" rel="nofollow"> ...
[4] <a href="https://profiles.ucl.ac.uk/91233-phillip-ayoub" rel="nofollow">< ...
[5] <a href="https://profiles.ucl.ac.uk/1510-kristin-bakke" rel="nofollow"><s ...
The names of each faculty member are stored in the text associated with these elements:
The URL for each faculty member is stored in the href attribute of the elements:
[1] "https://profiles.ucl.ac.uk/9520-rod-abouharb"
[2] "https://profiles.ucl.ac.uk/72700-valentina-amuso"
[3] "https://profiles.ucl.ac.uk/86078-samer-anabtawi"
[4] "https://profiles.ucl.ac.uk/91233-phillip-ayoub"
[5] "https://profiles.ucl.ac.uk/1510-kristin-bakke"
[6] "https://profiles.ucl.ac.uk/101976-carlos-balcazar"
name url
1 Dr M. Rodwan Abouharb https://profiles.ucl.ac.uk/9520-rod-abouharb
2 Dr Valentina Amuso https://profiles.ucl.ac.uk/72700-valentina-amuso
3 Dr Samer Anabtawi https://profiles.ucl.ac.uk/86078-samer-anabtawi
4 Professor Phillip Ayoub https://profiles.ucl.ac.uk/91233-phillip-ayoub
5 Professor Kristin M Bakke https://profiles.ucl.ac.uk/1510-kristin-bakke
6 Dr Carlos Balcazar https://profiles.ucl.ac.uk/101976-carlos-balcazar
text
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
library(rvest)
library(stringr)
jack_cell <- spp_page %>% #from the the main html
html_elements("table td:nth-child(2)") %>% # get the second column of every table row
# (col 1 = photo, col 2 = name + title + bio)
keep(~ grepl("Blumenau", # search for "Blumenau"
html_text2(.x))) # in the visible text of each cell
# keep() retains only matching cells
jack_cell{xml_nodeset (1)}
[1] <td>\n<p><a href="https://profiles.ucl.ac.uk/62358-jack-blumenau" rel="no ...
[1] "Dr Jack Blumenau\nAssociate Professor of Political Science and Quantitative Research Methods\n\nDr Blumenau’s research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems."
We have the text for one person! How do we get this for all faculty members?
for loopsfor loopsWe can use a for loop to loop over the elements of our url variable
# Step 1: Get all info cells from the surname table
# Each row has two columns: (1) photo, (2) name + title + bio
# We grab the second column of every row
all_cells <- spp_page %>%
html_elements("table td:nth-child(2)")
for(i in 1:nrow(spp)){
# Step 2: Get the last name for person i from our data frame
last_name <- word(spp$name[i], -1) # word() extracts the last word
# e.g. "Dr Jack Blumenau" → "Blumenau"
# Step 3: Search all cells for one containing that last name
person_cell <- all_cells %>%
keep(~ grepl(last_name, # search for the last name
html_text2(.x), # in the text of each cell
#.x is a placeholder for the current element
#html_text2 extracts visible text
fixed = TRUE)) # exact match (no regex)
# Step 4: Save the text, or NA if no match found
if(length(person_cell) > 0){
spp$text[i] <- html_text2(person_cell[[1]]) # take first match
} else {
spp$text[i] <- NA # person not in table
}
}[1] "Dr M. Rodwan Abouharb\nAssociate Professor in International Relations\n\nDr Abouharb’s research places particular emphasis on understanding how both domestic and international socio-economic processes affect the human security of citizens around the world."
Let’s use this data to estimate a topic model
Two questions in this application:
library(stm)
library(quanteda)
spp <- spp %>%
filter(!is.na(text) & text != "") # remove rows where the text is missing (NA)
# or empty ("") — these are people whose
# last name didn't match in the surname table
## Create dfm
spp_corpus <- spp %>%
corpus(text_field = "text")
spp_dfm <- spp_corpus %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE) %>%
dfm() %>%
dfm_remove(c(stopwords("en"), "book",
"journal", "professor",
"prof", "Prof.", "associate",
"teaching", "study","project","focus",
"focuses","focused", "interests", "particular",
"lecturer", "well", "including",
"include","dr","dr.","interested","works",
"studies","fellow","director", "emphasis", "within")) %>%
dfm_trim(min_termfreq = 4, # drop words that appear fewer than 2 times total
min_docfreq = 4) %>% # drop words that appear in fewer than 2 documents
dfm_trim(max_docfreq = .9, # drop words that appear in more than 90% of documents
docfreq_type = "prop") # interpret max_docfreq as a proportion (not a count)
# these are too common to be informative (e.g. "research")
## Estimate STM
stmOut <- stm(
documents = spp_dfm,
K = 12,
seed = 123,
verbose = FALSE #run silently
)
#save(stmOut, file = "stmOut.Rdata")Think about your research projects now!
There are several possible sources of data for these projects
Data collection is a major part of any research project – it is good to practice this step!
PUBL0099