Review an Existing Text-as-Data Application (25%)
Application of a Text-as-Data Method (75%)
Review an Existing Text-as-Data Application (25%)
Application of a Text-as-Data Method (75%)
Note: Points in italics are only required for PGT students.
Identify an interesting question
Search for data that can answer that question
Answer the question
Likely to lead to more interesting questions
Likely to lead to less time exploring different methods
Potentially large amount of time spent searching for/collecting data
Potential risk of finding nothing
Identify an interesting dataset
Explore that dataset using some of the tools we cover on the course
Construct a research question that you can answer using that data
Answer the question
Lower frustration
Potentially faster
Potentially less interesting research question/answers
Often more limited metadata
Be clear about the concept that you are trying to measure
Discuss the assumptions behind your chosen method
Provide some form of validation for your measure. E.g.
Demonstrate something interesting with the concept that you have measured
Journal articles
Good starting points:
Read news articles
Discuss potential applications with classmates
Good starting points:
The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.
Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper
This is an excellent source of data for your projects!
Sometimes it can take a bit of searching through the files in each repository to figure out where the data is
Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)
Many of these datasets are potentially interesting to social scientists, e.g.
Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on
API: Application Programming Interface — a way for two pieces of software to talk to each other
Your software can receive (and also send) data automatically through these services
Data is sent by — the same way your browser does it
Most services have helping code (known as a wrapper) to construct http requests
Both the wrapper and the service itself are called APIs
http service also sometimes known as REST (REpresentational State Transfer)
APIs typically require you to register for an API key to allow access
Before you commit to using a given API, check what the rate limits are on its use
Make sure you register with the service in plenty of time to actually get the data!
Once registered, you will have access to some kind of key that will allow you to access the API
http
requestsIt is helpful to start paying attention to the structure of basic http requests.
For instance, let’s say we want to get some data from the TheyWorkForYou api.
A test request:
https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX
Parameters to the API are encoded in the URL
output
= Which format do you want returned?search
= Return speeches with which words?num
= number requestedkey
= access keyThe output of an API will typically not be in csv
or Rdata
format
Often, though not always, it will be in either JSON and XML
XML: eXtensible Markup Language
JSON : JavaScript Object Notation
If you have a choice, you probably want JSON
Both types of file are easily read into R
json_lite
and xml2
are the relevant packages
It’s not usually necessary to construct these kind of requests yourself
R, Python, and other programming languages have libraries to make it easier – but you have to find them!
I have provided a sample of APIs that have associated R packages on the next slide
The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.
There are many existing R packages that make it straightforward to retreive data from an API:
API | R package | Description |
---|---|---|
install.packages("rtweet") |
Twitter, small-scale use, no longer free! | |
Guardian Newspaper | install.packages("guardianapi") |
Full Guardian archive, 1999-present |
Wikipedia | install.packages("WikipediR") |
Wikipedia data and knowledge graph |
TheyWorkForYou | install.packages("twfy") |
Speeches from the UK House of Commons and Lords |
ProPublica Congress API | install.packages("ProPublicaR") |
Data from the US Congress |
Google Books Ngrams | install.packages("ngramr") |
Ngrams in Google Books, 1500-present |
install.packages("RedditExtractoR") |
Subreddits, users, urls, texts of posts |
Warning: I have not tested all of these!
We will use the Reddit API to search for subreddits on UK Politics in the past year
For this example, we are not collecting a large amount of data
In general, you need to created a authenticated client id
Rate limits: currently 100 queries per minute (QPM) per OAuth client id
We will use library(RedditExtractoR)
Rows: 233
Columns: 7
$ id <chr> "6c6t7t", "2rgbp", "483mxu", "2vorv", "324zi", "bxukg", "2…
$ date_utc <chr> "2022-05-08", "2010-01-21", "2021-04-08", "2012-12-01", "2…
$ timestamp <dbl> 1652004172, 1264066642, 1617865911, 1354391253, 1402456586…
$ subreddit <chr> "DeppDelusion", "piratepartyofcanada", "Divisive_Babble", …
$ title <chr> "Snapping you out of the delusion that Johnny Depp is a vi…
$ description <chr> "For people who feel gaslighted by the mainstream opinion …
$ subscribers <dbl> 27001, 508, 1015, 6956, 2423058, 0, 23117, 879, 21440, 0, …
Let’s clean the dataset and order those subreddits by number of subscribers.
subred <- subred %>%
# selects variables you want to explore
select(subreddit, title, description, subscribers) %>%
# creates new variables
mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>%
# arranges data from highest subscriber count
arrange(desc(subscribers_million))
head(subred[c("subreddit", "title", "subscribers_million")])
subreddit title subscribers_million
2qh1i AskReddit Ask Reddit... 51.217430
2qh13 worldnews World News 43.993982
2qjpg memes /r/Memes the original since 2008 35.311800
2qh55 food Welcome to /r/Food on Reddit! 24.303083
2cneq politics Politics 8.725471
2qh4j europe Europe 8.403108
Many are not about UK politics!
Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.
uk.subred <- subred %>%
filter(grepl("UK politic|British politic", description, ignore.case = TRUE) |
grepl("UK politic|British politic", title, ignore.case = TRUE))
head(uk.subred[c("subreddit", "title", "subscribers_million")])
subreddit title subscribers_million
2qhcv ukpolitics UK Politics 0.511380
30c1v LabourUK The British Labour Party 0.066979
tzpe1 UKPoliticalComedy UK Political Comedy 0.060137
33geh UK_Politics UK Politics 0.004905
2qo8i PoliticsUK UK Politics Discussion 0.002722
31c96 UKPoliticsDiscussion UK Politics Discussion 0.000548
The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics
Let’s have a look at these two:
What is in this data?
Rows: 1,001
Columns: 7
$ date_utc <chr> NA, "2020-12-01", "2020-09-08", "2020-08-10", "2020-07-30", …
$ timestamp <dbl> NA, 1606834722, 1599605761, 1597046698, 1596104216, 15949771…
$ title <chr> NA, "New from Starmzy - (TheIainDuncanSmiths on Twitter)", "…
$ text <chr> NA, "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> NA, "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalCo…
$ comments <dbl> NA, 2, 1, 15, 8, 4, 0, 0, 1, 0, 0, 6, 0, 0, 8, 0, 1, 3, 3, 1…
$ url <chr> NA, "https://www.reddit.com/r/UKPoliticalComedy/comments/k4m…
Rows: 955
Columns: 7
$ date_utc <chr> "2022-01-10", "2020-12-16", "2020-11-19", "2019-09-30", "201…
$ timestamp <dbl> 1641774825, 1608100800, 1605782676, 1569828606, 1566291684, …
$ title <chr> "Downing Street has formally denied a FOI request asking for…
$ text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments <dbl> 232, 475, 187, 358, 303, 182, 428, 1544, 347, 1850, 271, 56,…
$ url <chr> "https://www.reddit.com/r/ukpolitics/comments/s067l9/downing…
We will work on the titles of those threads to measure sentiment with a dictionary based approach
make.dfm <- function(data){
dfm <- data %>%
corpus(text_field = "title") %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE) %>%
dfm() %>%
dfm_remove(stopwords("en")) %>%
dfm_trim(min_termfreq = 3,
max_docfreq = .9,
docfreq_type = "prop") %>%
# New step: Keep words of at least 3 characters
dfm_select(pattern = "\\b\\w{3,}\\b", valuetype = "regex", selection = "keep")
return(dfm)
}
comedy.dfm <- make.dfm(uk.comedy)
politics.dfm <- make.dfm(uk.politics)
How many features in those DFMs?
Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015
dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015)%>%
dfm_remove(c("neg_positive", "neg_negative")) %>%
dfm_weight(scheme = "logave") %>%
convert("data.frame") %>%
mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>%
colMeans(na.rm = TRUE) %>% print()
negative positive neutral
0.05694306 0.11888112 0.82817183
dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>%
dfm_remove(c("neg_positive", "neg_negative")) %>%
dfm_weight(scheme = "logave") %>%
convert("data.frame") %>%
mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>%
colMeans(na.rm = TRUE) %>% print()
negative positive neutral
0.3434555 0.2523560 0.5424084
You could use those functions to explore:
The correlation between comment sentiment and upvotes/downvotes
Topics across sub-reddits
How conversation evolve depending on the topic
… and many other research questions!
If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com
Key steps in any web-scraping project:
Work out how the website is structured
Work out how links connect different pages
Isolate the information you care about on each page
Write a loop which connects 3 to 2, and saves the information you want from each page
Put it all into a nice and tidy data.frame
Feel like a superhero
(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)
Web-scraping can be illegal in some circumstances
Web-scraping is more likely to be illegal when…
It is harmful to the source, e.g.,
It gathers data that is under copywrite/has privacy restrictions/used for financial gain
Even if not illegal, web-scraping can be ethically dubious. Especially when…
it is edging towards being illegal
the data is otherwise available via an API
it does not respect restrictions specified by the host website (often specified in a robots.txt
file)
We will scrape the research interests of members of faculty in the Department of Political Science at UCL
The departmental website has a list of faculty members
Each member of the department has a unique page
The research interests of the faculty member are stored on their unique page
Let’s look at an exmple…
To collect the information we want, we need to see how it is stored within the html code that underpins the website
Webpages include much more than what is immediately visible to visitors
Crucially, they include code which provides structure, style and functionality (which your browser interprets)
HTML
provides strucutrecss
provides styleJavaScript
provides functionalityTo implement a web-scraper, we have to work directly with the source code
To see the source code, use Ctrl + U
or right click and select View/Show Page Source
We can read the source code of any website into R using the readLines()
function.
[1] "<!DOCTYPE html>"
[2] "<!--[if IE 7]>"
[3] "<html lang=\"en\" class=\"lt-ie9 lt-ie8 no-js\"> <![endif]-->"
[4] "<!--[if IE 8]>"
[5] "<html lang=\"en\" class=\"lt-ie9 no-js\"> <![endif]-->"
[6] "<!--[if gt IE 8]><!-->"
[7] "<html lang=\"en\" class=\"no-js\"> <!--<![endif]-->"
[8] "<head>"
[9] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"
[10] " <meta name=\"author\" content=\"UCL\"/>"
[11] " <meta property=\"og:profile_id\" content=\"uclofficial\"/>"
[12] " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"
[13] "<link rel=\"shortcut icon\" href=\"https://www.ucl.ac.uk/political-science/sites/all/themes/indigo/favicon.ico\" type=\"image/vnd.microsoft.icon\" />"
[14] "<link rel=\"canonical\" href=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"
[15] "<meta name=\"ucl:faculty\" content=\"Social & Historical Sciences\" />"
[16] "<meta property=\"og:site_name\" content=\"Department of Political Science\" />"
[17] "<meta name=\"ucl:sanitized_org_unit\" content=\"Department of Political Science\" />"
[18] "<meta property=\"og:type\" content=\"website\" />"
[19] "<meta property=\"og:title\" content=\"Academic, Teaching, and Research Staff\" />"
[20] "<meta property=\"og:url\" content=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"
This is helpful, but it is awkward to navigate the source code directly.
The read_html
function in the rvest
package allows us to read the HTML in a more structured format:
We can then navigate through the HTML by searching for elements that have common elements (using html_elements()
):
{xml_nodeset (6)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[2] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[3] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[4] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[5] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[6] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
The names of each faculty member are stored in the text associated with these elements:
[1] "<a href=\"/political-science/people/academic-teaching-and-research-staff/dr-jared-j-finnegan\" class=\"nav-item\">Dr Jared Finnegan</a>"
The URL for each faculty member is stored in the href
attribute of the elements:
# html_attr() retrieves the attributes associated with the elements that we extracted above
spp_urls <- spp_faculty_elements %>% html_attr("href")
head(spp_urls)
[1] "/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"
[2] "/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"
[3] "/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"
[4] "/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"
[6] "/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"
# paste0() joins strings together
spp_urls <- paste0("https://www.ucl.ac.uk", spp_urls)
head(spp_urls)
[1] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"
[2] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"
[3] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"
[4] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"
[6] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"
name
1 Andrew Scott
2 Bugra Susler
3 Dr Adam Harris
4 Dr Alexandra Hartman
5 Dr Amanda Hall
6 Dr Aparna Ravi
url
1 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott
2 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler
3 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris
4 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman
5 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall
6 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi
text
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...
[1] "My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion."
We have the text for one person! How do we get this for all faculty members?
for
loopsfor
loopsWe can use a for
loop to loop over the elements of our url
variable
for(i in 1:nrow(spp)){
# Load page for faculty member i
faculty_member_page <- read_html(spp$url[i])
# Extract text from that page
faculty_member_text <- faculty_member_page %>%
html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
html_text() %>%
paste0(collapse = " ")
# Save text for faculty member i
spp$text[i] <- faculty_member_text
}
name
26 Dr Jack Blumenau
url
26 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-jack-blumenau
text
26 My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion.
name
91 Professor Lisa Vanhala
url
91 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-lisa-vanhala
text
91 I am interested in the politics of climate change and the socio-legal study of human rights and equality issues. My current, ERC-funded project, the Politics and Governance of Climate Change Loss and Damage (CCLAD), explores attempts to govern the impacts of climate change we will not be able to adapt to at a global and national level. Relying on a political ethnographic approach, the project explores the role of norms, identities and the micro-level, everyday dynamics of global environmental governance.
name
87 Professor Jeffrey Howard
url
87 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-jeffrey-howard
text
87 I currently direct the Digital Speech Lab, which hosts a range of research projects on the proper governance of online communications. Its purpose is to identify the fundamental principles that should guide the private and public regulation of online speech, and to trace those principles’ concrete implications in the face of difficult dilemmas about how best to respect free speech while preventing harm. The research team synthesizes expertise in political and moral philosophy, the philosophy of language, law and regulation, political science, and computer science. We engage a wide range of decisionmakers in industry, civil society, and policymaking. The Lab is funded by a UKRI Future Leaders Fellowship.
Let’s use this data to estimate a topic model
Two questions in this application:
library(stm)
## Create dfm
spp_corpus <- spp %>%
corpus(text_field = "text")
spp_dfm <- spp_corpus %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE) %>%
dfm() %>%
dfm_remove(c(stopwords("en"), "book", "journal", "professor",
"including", "include")) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 2) %>%
dfm_trim(max_docfreq = .7, docfreq_type = "prop")
## Estimate STM
stmOut <- stm(
documents = spp_dfm,
K = 12,
seed = 123,
verbose = FALSE
)
#save(stmOut, file = "stmOut.Rdata")
Think about your research projects now!
There are several possible sources of data for these projects
Data collection is a major part of any research project – it is good to practice this step!
PUBL0099