Qualitative comments focused on three main themes:
Instruction on the final project
More support on math / coding
“Complete analysis case”
Review an Existing Text-as-Data Application (25%)
Application of a Text-as-Data Method (75%)
Review an Existing Text-as-Data Application (25%)
Application of a Text-as-Data Method (75%)
Identify an interesting question
Search for data that can answer that question
Answer the question
Likely to lead to more interesting questions
Likely to lead to less time exploring different methods
Potentially large amount of time spent searching for/collecting data
Potential risk of finding nothing
Identify an interesting dataset
Explore that dataset using some of the tools we cover on the course
Construct a research question that you can answer using that data
Answer the question
Lower frustration
Potentially faster
Potentially less interesting research question/answers
Often more limited metadata
Be clear about the concept that you are trying to measure
Discuss the assumptions behind your chosen method
Provide some form of validation for your measure. E.g.
Demonstrate something interesting with the concept that you have measured
Journal articles
Good starting points:
Read news articles
Discuss potential applications with classmates
Good starting points:
The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.
Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper
This is an excellent source of data for your projects!
Sometimes it can take a bit of searching through the files in each repository to figure out where the data is
Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)
Many of these datasets are potentially interesting to social scientists, e.g.
Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on
API: Application Programming Interface — a way for two pieces of software to talk to each other
Your software can receive (and also send) data automatically through these services
Data is sent by — the same way your browser does it
Most services have helping code (known as a wrapper) to construct http requests
Both the wrapper and the service itself are called APIs
http service also sometimes known as REST (REpresentational State Transfer)
APIs typically require you to register for an API key to allow access
Before you commit to using a given API, check what the rate limits are on its use
Make sure you register with the service in plenty of time to actually get the data!
Once registered, you will have access to some kind of key that will allow you to access the API
http
requestsIt is helpful to start paying attention to the structure of basic http requests.
For instance, let’s say we want to get some data from the TheyWorkForYou api.
A test request:
https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX
Parameters to the API are encoded in the URL
output
= Which format do you want returned?search
= Return speeches with which words?num
= number requestedkey
= access keyThe output of an API will typically not be in csv
or Rdata
format
Often, though not always, it will be in either JSON and XML
XML: eXtensible Markup Language
JSON : JavaScript Object Notation
If you have a choice, you probably want JSON
Both types of file are easily read into R
json_lite
and xml2
are the relevant packages
It’s not usually necessary to construct these kind of requests yourself
R, Python, and other programming languages have libraries to make it easier – but you have to find them!
I have provided a sample of APIs that have associated R packages on the next slide
The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.
There are many existing R packages that make it straightforward to retreive data from an API:
API | R package | Description |
---|---|---|
install.packages("rtweet") |
Twitter, small-scale use, no longer free! | |
Guardian Newspaper | install.packages("guardianapi") |
Full Guardian archive, 1999-present |
Wikipedia | install.packages("WikipediR") |
Wikipedia data and knowledge graph |
TheyWorkForYou | install.packages("twfy") |
Speeches from the UK House of Commons and Lords |
ProPublica Congress API | install.packages("ProPublicaR") |
Data from the US Congress |
Google Books Ngrams | install.packages("ngramr") |
Ngrams in Google Books, 1500-present |
install.packages("RedditExtractoR") |
Warning: I have not tested all of these!
We will use the Reddit API to search for subreddits on UK Politics in the past year
For this example, we are not collecting a large amount of data
In general, you need to created a authenticated client id
Rate limits: currently 100 queries per minute (QPM) per OAuth client id
We will use library(RedditExtractoR)
Rows: 231
Columns: 7
$ id <chr> "27hnjr", "3ahdz", "6c6t7t", "2vorv", "2rgbp", "2zc4g", "3…
$ date_utc <chr> "2019-10-30", "2015-10-25", "2022-05-08", "2012-12-01", "2…
$ timestamp <dbl> 1572393839, 1445747908, 1652004172, 1354391253, 1264066642…
$ subreddit <chr> "HouseOfTheDragon", "MCBC", "DeppDelusion", "UrbanStudies"…
$ title <chr> "House of the Dragon", "Model Canadian Broadcasting Corpor…
$ description <chr> "This is a place for news and discussions relating to HBO'…
$ subscribers <dbl> 1054533, 67, 23214, 6665, 498, 80377, 859, 0, 0, 20444, 86…
Let’s clean the dataset and order those subreddits by number of subscribers.
subred <- subred %>%
# selects variables you want to explore
select(subreddit, title, description, subscribers) %>%
# creates new variables
mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>%
# arranges data from highest subscriber count
arrange(desc(subscribers_million))
head(subred[c("subreddit", "title", "subscribers_million")])
subreddit title subscribers_million
2qh1i AskReddit Ask Reddit... 44.940265
2qh13 worldnews World News 34.855445
2qjpg memes /r/Memes the original since 2008 29.655626
2cneq politics Politics 8.471319
2qh4j europe Europe 5.729439
2w844 NoStupidQuestions No such thing as stupid questions 4.326992
Many are not about UK politics!
Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.
uk.subred <- subred %>%
filter(grepl("UK politic|British politic", description, ignore.case = TRUE) |
grepl("UK politic|British politic", title, ignore.case = TRUE))
head(uk.subred[c("subreddit", "title", "subscribers_million")])
subreddit title subscribers_million
2qhcv ukpolitics UK Politics 0.477389
30c1v LabourUK The British Labour Party 0.064691
tzpe1 UKPoliticalComedy UK Political Comedy 0.059778
33geh UK_Politics UK Politics 0.004740
2qo8i PoliticsUK UK Politics Discussion 0.002090
3euqz casualukpolitics casual UK politics 0.000474
The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics
Let’s have a look at these two:
What is in this data?
Rows: 999
Columns: 7
$ date_utc <chr> "2022-02-08", "2021-11-10", "2021-09-24", "2021-09-02", "202…
$ timestamp <dbl> 1644362664, 1636564393, 1632497067, 1630616527, 1627022456, …
$ title <chr> "Palps new job", "Sums things up pretty well.", "I am so unp…
$ text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalComedy…
$ comments <dbl> 0, 4, 0, 1, 3, 5, 6, 0, 0, 2, 3, 3, 4, 2, 2, 2, 1, 1, 3, 1, …
$ url <chr> "https://www.reddit.com/r/UKPoliticalComedy/comments/sny3f5/…
Rows: 955
Columns: 7
$ date_utc <chr> "2020-11-19", "2019-09-30", "2019-08-20", "2022-01-20", "202…
$ timestamp <dbl> 1605782676, 1569828606, 1566291684, 1642682212, 1600764671, …
$ title <chr> "Government finally admits only £3bn of money for green reco…
$ text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments <dbl> 187, 360, 304, 182, 428, 1549, 347, 1866, 273, 56, 216, 620,…
$ url <chr> "https://www.reddit.com/r/ukpolitics/comments/jx0kaz/governm…
We will work on the titles of those threads to measure sentiment with a dictionary based approach
make.dfm <- function(data){
dfm <- data %>%
corpus(text_field = "title") %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE) %>%
dfm() %>%
dfm_remove(stopwords("en")) %>%
dfm_trim(min_termfreq = 3,
max_docfreq = .9,
docfreq_type = "prop") %>%
# New step: Keep words of at least 3 characters
dfm_select(pattern = "\\b\\w{3,}\\b", valuetype = "regex", selection = "keep")
return(dfm)
}
comedy.dfm <- make.dfm(uk.comedy)
politics.dfm <- make.dfm(uk.politics)
How many features in those DFMs?
Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015
dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015)%>%
dfm_remove(c("neg_positive", "neg_negative")) %>%
dfm_weight(scheme = "logave") %>%
convert("data.frame") %>%
mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>%
colMeans(na.rm = TRUE) %>% print()
negative positive neutral
0.05205205 0.12212212 0.82982983
dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>%
dfm_remove(c("neg_positive", "neg_negative")) %>%
dfm_weight(scheme = "logave") %>%
convert("data.frame") %>%
mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>%
mutate(neutral = positive == negative) %>%
colMeans(na.rm = TRUE) %>% print()
negative positive neutral
0.3434555 0.2502618 0.5403141
You could use those functions to explore:
The correlation between comment sentiment and upvotes/downvotes
Topics across sub-reddits
How conversation evolve depending on the topic
… and many other research questions!
If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com
Key steps in any web-scraping project:
Work out how the website is structured
Work out how links connect different pages
Isolate the information you care about on each page
Write a loop which connects 3 to 2, and saves the information you want from each page
Put it all into a nice and tidy data.frame
Feel like a superhero
(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)
Web-scraping can be illegal in some circumstances
Web-scraping is more likely to be illegal when…
It is harmful to the source, e.g.,
It gathers data that is under copywrite/has privacy restrictions/used for financial gain
Even if not illegal, web-scraping can be ethically dubious. Especially when…
it is edging towards being illegal
the data is otherwise available via an API
it does not respect restrictions specified by the host website (often specified in a robots.txt
file)
We will scrape the research interests of members of faculty in the Department of Political Science at UCL
The departmental website has a list of faculty members
Each member of the department has a unique page
The research interests of the faculty member are stored on their unique page
Let’s look at an exmple…
To collect the information we want, we need to see how it is stored within the html code that underpins the website
Webpages include much more than what is immediately visible to visitors
Crucially, they include code which provides structure, style and functionality (which your browser interprets)
HTML
provides strucutrecss
provides styleJavaScript
provides functionalityTo implement a web-scraper, we have to work directly with the source code
To see the source code, use Ctrl + U
or right click and select View/Show Page Source
We can read the source code of any website into R using the readLines()
function.
[1] "<!DOCTYPE html>"
[2] "<!--[if IE 7]>"
[3] "<html lang=\"en\" class=\"lt-ie9 lt-ie8 no-js\"> <![endif]-->"
[4] "<!--[if IE 8]>"
[5] "<html lang=\"en\" class=\"lt-ie9 no-js\"> <![endif]-->"
[6] "<!--[if gt IE 8]><!-->"
[7] "<html lang=\"en\" class=\"no-js\"> <!--<![endif]-->"
[8] "<head>"
[9] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"
[10] " <meta name=\"author\" content=\"UCL\"/>"
[11] " <meta property=\"og:profile_id\" content=\"uclofficial\"/>"
[12] " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"
[13] "<link rel=\"shortcut icon\" href=\"https://www.ucl.ac.uk/political-science/sites/all/themes/indigo/favicon.ico\" type=\"image/vnd.microsoft.icon\" />"
[14] "<link rel=\"canonical\" href=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"
[15] "<meta name=\"ucl:faculty\" content=\"Social & Historical Sciences\" />"
[16] "<meta property=\"og:site_name\" content=\"Department of Political Science\" />"
[17] "<meta name=\"ucl:sanitized_org_unit\" content=\"Department of Political Science\" />"
[18] "<meta property=\"og:type\" content=\"website\" />"
[19] "<meta property=\"og:title\" content=\"Academic, Teaching, and Research Staff\" />"
[20] "<meta property=\"og:url\" content=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"
This is helpful, but it is awkward to navigate the source code directly.
The read_html
function in the rvest
package allows us to read the HTML in a more structured format:
We can then navigate through the HTML by searching for elements that have common elements (using html_elements()
):
{xml_nodeset (6)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[2] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[3] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[4] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[5] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[6] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
The names of each faculty member are stored in the text associated with these elements:
[1] "<a href=\"/political-science/people/academic-teaching-and-research-staff/dr-jeremy-bowles\" class=\"nav-item\">Dr Jeremy Bowles</a>"
The URL for each faculty member is stored in the href
attribute of the elements:
# html_attr() retrieves the attributes associated with the elements that we extracted above
spp_urls <- spp_faculty_elements %>% html_attr("href")
head(spp_urls)
[1] "/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"
[2] "/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"
[3] "/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"
[4] "/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"
[6] "/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"
# paste0() joins strings together
spp_urls <- paste0("https://www.ucl.ac.uk", spp_urls)
head(spp_urls)
[1] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"
[2] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"
[3] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"
[4] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"
[6] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"
name
1 Andrew Scott
2 Bugra Susler
3 Dr Adam Harris
4 Dr Alexandra Hartman
5 Dr Amanda Hall
6 Dr Aparna Ravi
url
1 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott
2 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler
3 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris
4 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman
5 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall
6 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi
text
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...
[1] "My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion."
We have the text for one person! How do we get this for all faculty members?
for
loopsfor
loopsWe can use a for
loop to loop over the elements of our url
variable
for(i in 1:nrow(spp)){
# Load page for faculty member i
faculty_member_page <- read_html(spp$url[i])
# Extract text from that page
faculty_member_text <- faculty_member_page %>%
html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
html_text() %>%
paste0(collapse = " ")
# Save text for faculty member i
spp$text[i] <- faculty_member_text
}
name
35 Dr Kalina Zhekova
url
35 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-kalina-zhekova
text
35 My current research focuses on Russian foreign policy and military interventions in the post-Soviet space and the Middle East, and Russian collective (mis-)conceptions of state sovereignty and relations with the West. I am interested in interpretivist and poststructuralist approaches to Russian politics, and I examine the development of external and internal threat constructions, notions of inter-ethnic identities and their mobilisation in the process of policymaking, war and violence. I apply this approach to the study of Russian armed interventions in Syria, Georgia and the war in Ukraine.
What should we do with this data?
As a preview of one of the topics we will cover after reading week, we will use this data to estimate a topic model
A topic model describes a collection of documents in terms of a distinct number of topics
Each document in the model is described as a mixture of corpus-wide topics
A topic is a probability distribution over words in the vocabulary
Two questions in this application:
Think about your research projects now!
There are several possible sources of data for these projects
Data collection is a major part of any research project – it is good to practice this step!
Today we will learn to use the Guardian Newspaper API via the guardianapi
package. There is also a web-scraping task for those of you would like to try!
PUBL0099