Review an Existing Text-as-Data Application (25%)
Application of a Text-as-Data Method (75%)
Review an Existing Text-as-Data Application (25%)
Application of a Text-as-Data Method (75%)
Note: Points in italics are only required for PGT students.
Identify an interesting question
Search for data that can answer that question
Answer the question
Likely to lead to more interesting questions
Likely to lead to less time exploring different methods
Potentially large amount of time spent searching for/collecting data
Potential risk of finding nothing
Identify an interesting dataset
Explore that dataset using some of the tools we cover on the course
Construct a research question that you can answer using that data
Answer the question
Lower frustration
Potentially faster
Potentially less interesting research question/answers
Often more limited metadata
Be clear about the concept that you are trying to measure
Discuss the assumptions behind your chosen method
Provide some form of validation for your measure. E.g.
Demonstrate something interesting with the concept that you have measured
Journal articles
Good starting points:
Read news articles
Discuss potential applications with classmates
Good starting points:
The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.
Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper
This is an excellent source of data for your projects!
Sometimes it can take a bit of searching through the files in each repository to figure out where the data is
Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)
Many of these datasets are potentially interesting to social scientists, e.g.
Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on
API: Application Programming Interface — a way for two pieces of software to talk to each other
Your software can receive (and also send) data automatically through these services
Data is sent by — the same way your browser does it
Most services have helping code (known as a wrapper) to construct http requests
Both the wrapper and the service itself are called APIs
http service also sometimes known as REST (REpresentational State Transfer)
APIs typically require you to register for an API key to allow access
Before you commit to using a given API, check what the rate limits are on its use
Make sure you register with the service in plenty of time to actually get the data!
Once registered, you will have access to some kind of key that will allow you to access the API
http requestsIt is helpful to start paying attention to the structure of basic http requests.
For instance, let’s say we want to get some data from the TheyWorkForYou api.
A test request:
https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX
Parameters to the API are encoded in the URL
output = Which format do you want returned?search = Return speeches with which words?num = number requestedkey = access keyThe output of an API will typically not be in csv or Rdata format
Often, though not always, it will be in either JSON and XML
XML: eXtensible Markup Language
JSON : JavaScript Object Notation
If you have a choice, you probably want JSON
Both types of file are easily read into R
json_lite and xml2 are the relevant packages
It’s not usually necessary to construct these kind of requests yourself
R, Python, and other programming languages have libraries to make it easier – but you have to find them!
I have provided a sample of APIs that have associated R packages on the next slide
The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.
There are many existing R packages that make it straightforward to retreive data from an API:
| API | R package | Description |
|---|---|---|
install.packages("rtweet") |
Twitter, small-scale use, no longer free! | |
| Guardian Newspaper | install.packages("guardianapi") |
Full Guardian archive, 1999-present |
| Wikipedia | install.packages("WikipediR") |
Wikipedia data and knowledge graph |
| TheyWorkForYou | install.packages("twfy") |
Speeches from the UK House of Commons and Lords |
| ProPublica Congress API | install.packages("ProPublicaR") |
Data from the US Congress |
| Google Books Ngrams | install.packages("ngramr") |
Ngrams in Google Books, 1500-present |
install.packages("RedditExtractoR") |
Subreddits, users, urls, texts of posts |
Warning: I have not tested all of these!
We will use the Reddit API to search for subreddits on UK Politics in the past year
For this example, we are not collecting a large amount of data
In general, you need to created a authenticated client id
Rate limits: currently 100 queries per minute (QPM) per OAuth client id
We will use library(RedditExtractoR)
Rows: 233
Columns: 7
$ id <chr> "483mxu", "4rfocy", "3g37x", "2r7v0", "jrrb9", "3ahdz", "6…
$ date_utc <chr> "2021-04-08", "2021-07-15", "2016-08-30", "2009-09-23", "2…
$ timestamp <dbl> 1617865911, 1626368162, 1472562586, 1253680171, 1527610526…
$ subreddit <chr> "Divisive_Babble", "SteamDeck", "brandonlawson", "Academic…
$ title <chr> "Divisive Babble", "Steam Deck", "The Search For Brandon L…
$ description <chr> "This is a forum for the discussion of politics, current a…
$ subscribers <dbl> 1164, 1034951, 5667, 58183, 91, 74, 29290, 9842, 1244195, …
Let’s clean the dataset and order those subreddits by number of subscribers.
subred <- subred %>%
# selects variables you want to explore
select(subreddit, title, description, subscribers) %>%
# creates new variables
mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>%
# arranges data from highest subscriber count
arrange(desc(subscribers_million))
head(subred[c("subreddit", "title", "subscribers_million")]) subreddit title subscribers_million
2qh1i AskReddit Ask Reddit... 57.322329
2qh13 worldnews World News 46.913243
2qjpg memes /r/Memes the original since 2008 35.533307
2qh55 food Food Photos on Reddit 24.381233
2qh4j europe Europe 11.460773
2cneq politics Politics 8.966519
Many are not about UK politics!
Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.
subreddit title subscribers_million
2qhcv ukpolitics UK Politics 0.532235
30c1v LabourUK The British Labour Party 0.069528
tzpe1 UKPoliticalComedy UK Political Comedy 0.059908
33geh UK_Politics UK Politics 0.005360
2qo8i PoliticsUK UK Politics Discussion 0.003332
4th4dw AskUKPolitics AskUKPolitics 0.000642
The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics
Let’s have a look at these two:
What is in this data?
Rows: 248
Columns: 7
$ date_utc <chr> "2021-11-10", "2023-01-27", "2022-10-03", "2021-03-20", "202…
$ timestamp <dbl> 1636530752, 1674848275, 1664830422, 1616259328, 1615137661, …
$ title <chr> "New Conservatives logo", "Time to light the fire under Mr Z…
$ text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalComedy…
$ comments <dbl> 3, 1, 5, 7, 6, 0, 13, 8, 1, 0, 12, 7, 1, 8, 4, 11, 36, 1, 28…
$ url <chr> "https://www.reddit.com/r/UKPoliticalComedy/comments/qqp9c0/…
Rows: 228
Columns: 7
$ date_utc <chr> "2019-03-18", "2017-09-10", "2021-01-06", "2019-05-02", "201…
$ timestamp <dbl> 1552923461, 1505071469, 1609942152, 1556832034, 1575982428, …
$ title <chr> "BREAKING: Speaker rules out the government bringing back me…
$ text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments <dbl> 1061, 871, 354, 170, 771, 129, 640, 1648, 681, 569, 252, 575…
$ url <chr> "https://www.reddit.com/r/ukpolitics/comments/b2k5i1/breakin…
We will work on the titles of those threads to measure sentiment with a dictionary based approach
make.dfm <- function(data){
dfm <- data %>%
corpus(text_field = "title") %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE) %>%
dfm() %>%
dfm_remove(stopwords("en")) %>%
dfm_trim(min_termfreq = 3,
max_docfreq = .9,
docfreq_type = "prop") %>%
# New step: Keep words of at least 3 characters
dfm_select(pattern = "\\b\\w{3,}\\b", valuetype = "regex", selection = "keep")
return(dfm)
}
comedy.dfm <- make.dfm(uk.comedy)
politics.dfm <- make.dfm(uk.politics)How many features in those DFMs?
Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015
dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015)%>%
dfm_remove(c("neg_positive", "neg_negative")) %>%
dfm_weight(scheme = "logave") %>%
convert("data.frame") %>%
mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>%
colMeans(na.rm = TRUE) %>% print() negative positive neutral
0.01209677 0.11693548 0.87096774
dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>%
dfm_remove(c("neg_positive", "neg_negative")) %>%
dfm_weight(scheme = "logave") %>%
convert("data.frame") %>%
mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>%
colMeans(na.rm = TRUE) %>% print() negative positive neutral
0.08771930 0.09649123 0.82456140
You could use those functions to explore:
The correlation between comment sentiment and upvotes/downvotes
Topics across sub-reddits
How conversation evolve depending on the topic
… and many other research questions!
If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com
Key steps in any web-scraping project:
Work out how the website is structured
Work out how links connect different pages
Isolate the information you care about on each page
Write a loop which connects 3 to 2, and saves the information you want from each page
Put it all into a nice and tidy data.frame
Feel like a superhero
(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)
Web-scraping can be illegal in some circumstances
Web-scraping is more likely to be illegal when…
It is harmful to the source, e.g.,
It gathers data that is under copywrite/has privacy restrictions/used for financial gain
Even if not illegal, web-scraping can be ethically dubious. Especially when…
it is edging towards being illegal
the data is otherwise available via an API
it does not respect restrictions specified by the host website (often specified in a robots.txt file)
We will scrape the research interests of members of faculty in the Department of Political Science at UCL
The departmental website has a list of faculty members
Each member of the department has a unique page
The research interests of the faculty member are stored on their unique page
Let’s look at an exmple…
To collect the information we want, we need to see how it is stored within the html code that underpins the website
Webpages include much more than what is immediately visible to visitors
Crucially, they include code which provides structure, style and functionality (which your browser interprets)
HTML provides strucutrecss provides styleJavaScript provides functionalityTo implement a web-scraper, we have to work directly with the source code
To see the source code, use Ctrl + U or right click and select View/Show Page Source
We can read the source code of any website into R using the readLines() function.
[1] "<!DOCTYPE html>"
[2] "<html lang=\"en\" dir=\"ltr\" prefix=\"og: https://ogp.me/ns#\">"
[3] " <head>"
[4] " <meta charset=\"utf-8\" />"
[5] "<script>window.privacyConfig = {"
[6] " \"privacyPolicySection\": {"
[7] " \"url\": \"https://www.ucl.ac.uk/legal-services/cookies\","
[8] " \"linkText\": \"UCL privacy policy\","
[9] " \"title\": \"Privacy Policy\","
[10] " \"description\": \"Learn more about how we handle your data and protect your privacy.\","
[11] " \"enabled\": true"
[12] " },"
[13] " \"whitelist\": ["
[14] " \"https://app.geckoform.com\","
[15] " \"https://cxppusa1formui01cdnsa01-endpoint.azureedge.net\","
[16] " \"https://assets-eur.mkt.dynamics.com\","
[17] " \"https://apply5.lumessetalentlink.com\","
[18] " \"https://api.reeled.online\""
[19] " ],"
[20] " \"reclassify\": {},"
[1] "<p><a href=\"https://profiles.ucl.ac.uk/44395-lauge-poulsen\" rel=\"nofollow\"><strong>Professor Lauge Poulsen</strong></a><br>Head of Department and Professor of International Relations and Law</p><p>Prof Poulsen works on the politics of international trade and investment, with a particular focus on international economic law.</p><p> </p>"
This is helpful, but it is awkward to navigate the source code directly.
The read_html function in the rvest package allows us to read the HTML in a more structured format:
We can then navigate through the HTML by searching for elements that have common elements (using html_elements()):
{xml_nodeset (6)}
[1] <a href="https://profiles.ucl.ac.uk/44395-lauge-poulsen" rel="nofollow">< ...
[2] <a href="https://profiles.ucl.ac.uk/9520-rod-abouharb"><strong>Dr Rodwan ...
[3] <a href="https://profiles.ucl.ac.uk/86078-samer-anabtawi"><strong>Dr Same ...
[4] <a href="https://profiles.ucl.ac.uk/91233-phillip-ayoub"><strong>Professo ...
[5] <a href="https://profiles.ucl.ac.uk/1510-kristin-bakke"><strong>Professor ...
[6] <a href="https://profiles.ucl.ac.uk/101976-carlos-balcazar"><strong>Dr Ca ...
The names of each faculty member are stored in the text associated with these elements:
[1] "<a href=\"https://profiles.ucl.ac.uk/85613-dan-honig\"><strong>Dr Daniel Honig</strong></a>"
The URL for each faculty member is stored in the href attribute of the elements:
[1] "https://profiles.ucl.ac.uk/44395-lauge-poulsen"
[2] "https://profiles.ucl.ac.uk/9520-rod-abouharb"
[3] "https://profiles.ucl.ac.uk/86078-samer-anabtawi"
[4] "https://profiles.ucl.ac.uk/91233-phillip-ayoub"
[5] "https://profiles.ucl.ac.uk/1510-kristin-bakke"
[6] "https://profiles.ucl.ac.uk/101976-carlos-balcazar"
name url
1 Professor Lauge Poulsen https://profiles.ucl.ac.uk/44395-lauge-poulsen
2 Dr Rodwan Abouharb https://profiles.ucl.ac.uk/9520-rod-abouharb
3 Dr Samer Anabtawi https://profiles.ucl.ac.uk/86078-samer-anabtawi
4 Professor Phillip Ayoub https://profiles.ucl.ac.uk/91233-phillip-ayoub
5 Professor Kristin Bakke https://profiles.ucl.ac.uk/1510-kristin-bakke
6 Dr Carlos Balcazar https://profiles.ucl.ac.uk/101976-carlos-balcazar
text
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body style="--logoHeight:70px;--logoVerticalMargins:7px;--homeBackground ...
We have the text for one person! How do we get this for all faculty members?
for loopsfor loopsWe can use a for loop to loop over the elements of our url variable
for(i in 1:nrow(spp)){
# Load page for faculty member i
faculty_member_page <- read_html(spp$url[i])
# Extract text from that page
faculty_member_text <- faculty_member_page %>%
html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
html_text() %>%
paste0(collapse = " ")
# Save text for faculty member i
spp$text[i] <- faculty_member_text
} name
26 Dr Jack Blumenau
url
26 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-jack-blumenau
text
26 My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion.
name
91 Professor Lisa Vanhala
url
91 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-lisa-vanhala
text
91 I am interested in the politics of climate change and the socio-legal study of human rights and equality issues. My current, ERC-funded project, the Politics and Governance of Climate Change Loss and Damage (CCLAD), explores attempts to govern the impacts of climate change we will not be able to adapt to at a global and national level. Relying on a political ethnographic approach, the project explores the role of norms, identities and the micro-level, everyday dynamics of global environmental governance.
name
87 Professor Jeffrey Howard
url
87 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-jeffrey-howard
text
87 I currently direct the Digital Speech Lab, which hosts a range of research projects on the proper governance of online communications. Its purpose is to identify the fundamental principles that should guide the private and public regulation of online speech, and to trace those principles’ concrete implications in the face of difficult dilemmas about how best to respect free speech while preventing harm. The research team synthesizes expertise in political and moral philosophy, the philosophy of language, law and regulation, political science, and computer science. We engage a wide range of decisionmakers in industry, civil society, and policymaking. The Lab is funded by a UKRI Future Leaders Fellowship.
Let’s use this data to estimate a topic model
Two questions in this application:
library(stm)
## Create dfm
spp_corpus <- spp %>%
corpus(text_field = "text")
spp_dfm <- spp_corpus %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE) %>%
dfm() %>%
dfm_remove(c(stopwords("en"), "book", "journal", "professor",
"including", "include")) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 2) %>%
dfm_trim(max_docfreq = .7, docfreq_type = "prop")
## Estimate STM
stmOut <- stm(
documents = spp_dfm,
K = 12,
seed = 123,
verbose = FALSE
)
#save(stmOut, file = "stmOut.Rdata")Think about your research projects now!
There are several possible sources of data for these projects
Data collection is a major part of any research project – it is good to practice this step!
PUBL0099