5: Designing Projects and Collecting Text Data

Jack Blumenau

Research Project Advice

Components

Review an Existing Text-as-Data Application (25%)
- Any paper that uses quantitative text analysis of the type we study in the course
- Must be from the social sciences (political science, public policy, economics, sociology, psychology, etc)
- 1,000 words
- Goal: Critical review of a single article
Application of a Text-as-Data Method (75%)
- Any method that we have studied on the course
- Must be answering a social science research question
- 2,000 words
- Goals: Demonstrate understanding of method, and ability to apply, validate and interpret

Criteria

Review an Existing Text-as-Data Application (25%)
- Accuractely describe method and key implementation details
- Discuss assumptions of method and whether they are met in this application
- Assessment of strengths and weaknesses of the approach
- Propose one alternative strategy for this application
Application of a Text-as-Data Method (75%)
- Clear and answerable research question
- Accurate description of selected method
- Discussion of assumptions behind the method
- Description/Use of a dataset that we haven’t used on the course
- Implementation of chosen method, with clear description/justification of analysis choices
- Accurate interpretation of method output
- Some attempt to validate chosen approach
- Discussion of strengths and weaknesses of chosen approach
- Some credit for originality and ambition of project

Note: Points in italics are only required for PGT students.

Questions First, Data Second

Identify an interesting question
Search for data that can answer that question
Answer the question

Advantages:

Likely to lead to more interesting questions
- A good research project answers an interesting question, and in this model you think about that first
Likely to lead to less time exploring different methods
- Your question will guide you to a particular analysis
- If you need to measure topics to answer your question, then use a topic model. If you need to measure complexity, then use a readability metric, etc.

Disadvantages:

Potentially large amount of time spent searching for/collecting data
- The data you need may not exist in an easily accessible form
- You will need to spend time and effort collecting it either by downloading it from various sources or getting it from an API or via web-scraping
Potential risk of finding nothing
- The data you need may not exist at all!
- Then you will need to find another question

Data First, Questions Second

Identify an interesting dataset
Explore that dataset using some of the tools we cover on the course
Construct a research question that you can answer using that data
Answer the question

Advantages:

Lower frustration
- You will not spend lots of effort thinking about projects that have no hope!
Potentially faster
- You will not spend lots of time trying to find data only to discover that it doesn’t exist

Disadvantages:

Potentially less interesting research question/answers
- e.g. if the only data you can find is a BBC food archive, then you might have to write a silly paper all about the classification of curry receipes…
Often more limited metadata
- The texts are themselves rarely sufficient for a compelling analysis. We often want metadata so that we can describe variation in some quantity of interest
- Does your data include information about the authors of the documents? The dates they were produced? Etc.

Principles of Good Research Projects

Be clear about the concept that you are trying to measure
- Why is this an interesting concept?
- Why is doing it this way an improvement on existing approaches?
Discuss the assumptions behind your chosen method
- What needs to be true in order for your measure to produce reasonable results?
Provide some form of validation for your measure. E.g.
- Hand-code a sample of documents and show that your quantitative text measure is associated with the human coding
- Show that your measure passess basic face-validity checks (does it correlate with things in sensible ways, are the top-scoring texts sensible, etc)
Demonstrate something interesting with the concept that you have measured
- Finding some texts and implementing a method on the course is not sufficient – you need to answer some kind of social-science research question
- Make use of metadata!
- Show how the quantities you are estimating vary across groups/over time etc.

Where To Find Good Research Questions

“How To Build An Economic Model In Your Spare Time”, Hal R. Varian

Where To Find Good Research Questions

Existing Research

Journal articles
Good starting points:

Thinking about the World

Read news articles
Discuss potential applications with classmates
Good starting points:
- The Economist
- The Financial Times
- Podcasts

Using Existing Datasets

Dataverse

The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.
Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper
This is an excellent source of data for your projects!
Sometimes it can take a bit of searching through the files in each repository to figure out where the data is

Kaggle

Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)
Many of these datasets are potentially interesting to social scientists, e.g.
- New York Times Comments
- Tweets about COVID-19
- Corpus of Academic Papers relating to COVID19
- Donald Trump’s Rally Speeches
- etc
Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on

APIs

API: Application Programming Interface — a way for two pieces of software to talk to each other
Your software can receive (and also send) data automatically through these services
Data is sent by — the same way your browser does it
Most services have helping code (known as a wrapper) to construct http requests
Both the wrapper and the service itself are called APIs
http service also sometimes known as REST (REpresentational State Transfer)

API registration and authentication

APIs typically require you to register for an API key to allow access
- Many are not free, at least for large-scale use
Before you commit to using a given API, check what the rate limits are on its use
- Limits on total number of requests for a given user
- Limits on the total number of requests in a given day/minute/hour etc
Make sure you register with the service in plenty of time to actually get the data!
Once registered, you will have access to some kind of key that will allow you to access the API

`http` requests

It is helpful to start paying attention to the structure of basic http requests.

For instance, let’s say we want to get some data from the TheyWorkForYou api.

A test request:

https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX

Parameters to the API are encoded in the URL
- output = Which format do you want returned?
- search = Return speeches with which words?
- num = number requested
- key = access key

API Output

The output of an API will typically not be in csv or Rdata format
Often, though not always, it will be in either JSON and XML
- XML: eXtensible Markup Language
- JSON : JavaScript Object Notation
If you have a choice, you probably want JSON
Both types of file are easily read into R
json_lite and xml2 are the relevant packages

API packages

It’s not usually necessary to construct these kind of requests yourself
R, Python, and other programming languages have libraries to make it easier – but you have to find them!
I have provided a sample of APIs that have associated R packages on the next slide
The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.

Sample of APIs

There are many existing R packages that make it straightforward to retreive data from an API:

API	R package	Description
Twitter	`install.packages("rtweet")`	Twitter, small-scale use, no longer free!
Guardian Newspaper	`install.packages("guardianapi")`	Full Guardian archive, 1999-present
Wikipedia	`install.packages("WikipediR")`	Wikipedia data and knowledge graph
TheyWorkForYou	`install.packages("twfy")`	Speeches from the UK House of Commons and Lords
ProPublica Congress API	`install.packages("ProPublicaR")`	Data from the US Congress
Google Books Ngrams	`install.packages("ngramr")`	Ngrams in Google Books, 1500-present
Reddit	`install.packages("RedditExtractoR")`	Subreddits, users, urls, texts of posts

Warning: I have not tested all of these!

API demonstration

Reddit API

We will use the Reddit API to search for subreddits on UK Politics in the past year
For this example, we are not collecting a large amount of data
In general, you need to created a authenticated client id
Rate limits: currently 100 queries per minute (QPM) per OAuth client id
We will use library(RedditExtractoR)

Reddit API Application

library(RedditExtractoR) 

subred <- find_subreddits("UK politics")

glimpse(subred)

Rows: 233
Columns: 7
$ id          <chr> "6c6t7t", "2rgbp", "483mxu", "2vorv", "324zi", "bxukg", "2…
$ date_utc    <chr> "2022-05-08", "2010-01-21", "2021-04-08", "2012-12-01", "2…
$ timestamp   <dbl> 1652004172, 1264066642, 1617865911, 1354391253, 1402456586…
$ subreddit   <chr> "DeppDelusion", "piratepartyofcanada", "Divisive_Babble", …
$ title       <chr> "Snapping you out of the delusion that Johnny Depp is a vi…
$ description <chr> "For people who feel gaslighted by the mainstream opinion …
$ subscribers <dbl> 27001, 508, 1015, 6956, 2423058, 0, 23117, 879, 21440, 0, …

Reddit API Application

Let’s clean the dataset and order those subreddits by number of subscribers.

subred <- subred %>%
  
  # selects variables you want to explore
  select(subreddit, title, description, subscribers) %>% 
  
  # creates new variables
  mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>% 
  
  # arranges data from highest subscriber count
  arrange(desc(subscribers_million)) 

head(subred[c("subreddit", "title", "subscribers_million")])

      subreddit                            title subscribers_million
2qh1i AskReddit                    Ask Reddit...           51.217430
2qh13 worldnews                       World News           43.993982
2qjpg     memes /r/Memes the original since 2008           35.311800
2qh55      food    Welcome to /r/Food on Reddit!           24.303083
2cneq  politics                         Politics            8.725471
2qh4j    europe                           Europe            8.403108

Reddit API Application

Many are not about UK politics!

Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.

uk.subred <- subred %>%
  filter(grepl("UK politic|British politic", description, ignore.case = TRUE) |
  grepl("UK politic|British politic", title, ignore.case = TRUE))

head(uk.subred[c("subreddit", "title", "subscribers_million")])

                 subreddit                    title subscribers_million
2qhcv           ukpolitics              UK Politics            0.511380
30c1v             LabourUK The British Labour Party            0.066979
tzpe1    UKPoliticalComedy      UK Political Comedy            0.060137
33geh          UK_Politics              UK Politics            0.004905
2qo8i           PoliticsUK   UK Politics Discussion            0.002722
31c96 UKPoliticsDiscussion   UK Politics Discussion            0.000548

Reddit API Application

The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics

Let’s have a look at these two:

uk.comedy <- find_thread_urls(subreddit = 'UKPoliticalComedy', sort_by = 'top', period = 'all')
uk.politics <- find_thread_urls(subreddit = 'ukpolitics', sort_by = 'top', period = 'all')

What is in this data?

glimpse(uk.comedy)

Rows: 1,001
Columns: 7
$ date_utc  <chr> NA, "2020-12-01", "2020-09-08", "2020-08-10", "2020-07-30", …
$ timestamp <dbl> NA, 1606834722, 1599605761, 1597046698, 1596104216, 15949771…
$ title     <chr> NA, "New from Starmzy - (TheIainDuncanSmiths on Twitter)", "…
$ text      <chr> NA, "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> NA, "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalCo…
$ comments  <dbl> NA, 2, 1, 15, 8, 4, 0, 0, 1, 0, 0, 6, 0, 0, 8, 0, 1, 3, 3, 1…
$ url       <chr> NA, "https://www.reddit.com/r/UKPoliticalComedy/comments/k4m…

glimpse(uk.politics)

Rows: 955
Columns: 7
$ date_utc  <chr> "2022-01-10", "2020-12-16", "2020-11-19", "2019-09-30", "201…
$ timestamp <dbl> 1641774825, 1608100800, 1605782676, 1569828606, 1566291684, …
$ title     <chr> "Downing Street has formally denied a FOI request asking for…
$ text      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments  <dbl> 232, 475, 187, 358, 303, 182, 428, 1544, 347, 1850, 271, 56,…
$ url       <chr> "https://www.reddit.com/r/ukpolitics/comments/s067l9/downing…

Reddit API Application

We will work on the titles of those threads to measure sentiment with a dictionary based approach

make.dfm <- function(data){
    dfm <- data %>%
        corpus(text_field = "title") %>%
        tokens(remove_punct = TRUE,
               remove_symbols = TRUE,
               remove_url = TRUE) %>%
        dfm() %>%
        dfm_remove(stopwords("en")) %>%
        dfm_trim(min_termfreq = 3,
                 max_docfreq = .9,
                 docfreq_type = "prop") %>%
    
        # New step: Keep words of at least 3 characters
        dfm_select(pattern = "\\b\\w{3,}\\b", valuetype = "regex", selection = "keep")

  return(dfm)
}


comedy.dfm <- make.dfm(uk.comedy)
politics.dfm <- make.dfm(uk.politics)

How many features in those DFMs?

dim(comedy.dfm)

[1] 1001  293

dim(politics.dfm)

[1]  955 1078

Reddit API Application

Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015

dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015)%>% 
  dfm_remove(c("neg_positive", "neg_negative")) %>%
  dfm_weight(scheme = "logave")  %>%
  convert("data.frame") %>%
  mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>% 
  mutate(neutral = positive == negative) %>% 
  colMeans(na.rm = TRUE) %>% print()

  negative   positive    neutral 
0.05694306 0.11888112 0.82817183

dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>% 
  dfm_remove(c("neg_positive", "neg_negative")) %>%
  dfm_weight(scheme = "logave")  %>%
  convert("data.frame") %>%
  mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>% 
  mutate(neutral = positive == negative) %>% 
  colMeans(na.rm = TRUE) %>% print()

 negative  positive   neutral 
0.3434555 0.2523560 0.5424084

Other functions

# Get thread content, for given URLs
get_thread_content()
# Get information on a particular user, for given list of users
get_user_content()

You could use those functions to explore:

The correlation between comment sentiment and upvotes/downvotes
Topics across sub-reddits
How conversation evolve depending on the topic
… and many other research questions!

Break & Q&A

If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com

Web-scraping

Web scraping overview

Key steps in any web-scraping project:

Work out how the website is structured
Work out how links connect different pages
Isolate the information you care about on each page
Write a loop which connects 3 to 2, and saves the information you want from each page
Put it all into a nice and tidy data.frame
Feel like a superhero

(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)

Web-scraping – Legal and Ethical Issues

Web-scraping can be illegal in some circumstances
- Many websites have proprietary information and expressly forbid the use of scraping programmes
- You should check before you scrape!
Web-scraping is more likely to be illegal when…
- It is harmful to the source, e.g.,
  - Scraping a commercial website to use the information for other commercial uses
  - Scraping a website so fast that is causes problems with the server
- It gathers data that is under copywrite/has privacy restrictions/used for financial gain
Even if not illegal, web-scraping can be ethically dubious. Especially when…
- it is edging towards being illegal
- the data is otherwise available via an API
- it does not respect restrictions specified by the host website (often specified in a robots.txt file)

Web-scraping Demonstration

We will scrape the research interests of members of faculty in the Department of Political Science at UCL
The departmental website has a list of faculty members
Each member of the department has a unique page
The research interests of the faculty member are stored on their unique page
Let’s look at an exmple…

Source code

To collect the information we want, we need to see how it is stored within the html code that underpins the website
Webpages include much more than what is immediately visible to visitors
Crucially, they include code which provides structure, style and functionality (which your browser interprets)
- HTML provides strucutre
- css provides style
- JavaScript provides functionality
To implement a web-scraper, we have to work directly with the source code
- Identifying the information on each page that we want to extract
- Identifying links between pages that help us to navigate the page programmatically

To see the source code, use Ctrl + U or right click and select View/Show Page Source

Load initial page

We can read the source code of any website into R using the readLines() function.

library(tidyverse)

spp_home <- "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff"

spp_html <- readLines(spp_home)

spp_html[1:20]

 [1] "<!DOCTYPE html>"                                                                                                                                      
 [2] "<!--[if IE 7]>"                                                                                                                                       
 [3] "<html lang=\"en\" class=\"lt-ie9 lt-ie8 no-js\"> <![endif]-->"                                                                                        
 [4] "<!--[if IE 8]>"                                                                                                                                       
 [5] "<html lang=\"en\" class=\"lt-ie9 no-js\"> <![endif]-->"                                                                                               
 [6] "<!--[if gt IE 8]><!-->"                                                                                                                               
 [7] "<html lang=\"en\" class=\"no-js\"> <!--<![endif]-->"                                                                                                  
 [8] "<head>"                                                                                                                                               
 [9] "  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"                                                                        
[10] "  <meta name=\"author\" content=\"UCL\"/>"                                                                                                            
[11] "  <meta property=\"og:profile_id\" content=\"uclofficial\"/>"                                                                                         
[12] "  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"                                                                          
[13] "<link rel=\"shortcut icon\" href=\"https://www.ucl.ac.uk/political-science/sites/all/themes/indigo/favicon.ico\" type=\"image/vnd.microsoft.icon\" />"
[14] "<link rel=\"canonical\" href=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"                              
[15] "<meta name=\"ucl:faculty\" content=\"Social &amp; Historical Sciences\" />"                                                                           
[16] "<meta property=\"og:site_name\" content=\"Department of Political Science\" />"                                                                       
[17] "<meta name=\"ucl:sanitized_org_unit\" content=\"Department of Political Science\" />"                                                                 
[18] "<meta property=\"og:type\" content=\"website\" />"                                                                                                    
[19] "<meta property=\"og:title\" content=\"Academic, Teaching, and Research Staff\" />"                                                                    
[20] "<meta property=\"og:url\" content=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"

spp_html[grep("Professor Ben", spp_html)[1]]

[1] "<li><a href=\"/political-science/people/academic-teaching-and-research-staff/professor-benjamin-lauderdale\" class=\"nav-item\">Professor Benjamin Lauderdale</a></li>"

This is helpful, but it is awkward to navigate the source code directly.

Parse HTML

The read_html function in the rvest package allows us to read the HTML in a more structured format:

library(rvest)

spp <- read_html(spp_home)

spp

{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve names for each member of faculty

We can then navigate through the HTML by searching for elements that have common elements (using html_elements()):

spp_faculty_elements <- spp %>% html_elements("a[class='nav-item']") 

head(spp_faculty_elements)

{xml_nodeset (6)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[2] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[3] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[4] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[5] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[6] <a href="/political-science/people/academic-teaching-and-research-staff/d ...

The names of each faculty member are stored in the text associated with these elements:

[1] "<a href=\"/political-science/people/academic-teaching-and-research-staff/dr-jared-j-finnegan\" class=\"nav-item\">Dr Jared Finnegan</a>"

We can extract these names using the html_text() function:

spp_faculty_names <- spp_faculty_elements %>% html_text()
head(spp_faculty_names)

[1] "Andrew Scott"         "Bugra Susler"         "Dr Adam Harris"      
[4] "Dr Alexandra Hartman" "Dr Amanda Hall"       "Dr Aparna Ravi"

Retrieve URL for each member of faculty

The URL for each faculty member is stored in the href attribute of the elements:

# html_attr() retrieves the attributes associated with the elements that we extracted above
spp_urls <- spp_faculty_elements %>% html_attr("href") 

head(spp_urls)

[1] "/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"

# paste0() joins strings together
spp_urls <- paste0("https://www.ucl.ac.uk", spp_urls)

head(spp_urls)

[1] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"

Storage

spp <- data.frame(name = spp_faculty_names, 
                  url = spp_urls, 
                  text = NA)

head(spp)

                  name
1         Andrew Scott
2         Bugra Susler
3       Dr Adam Harris
4 Dr Alexandra Hartman
5       Dr Amanda Hall
6       Dr Aparna Ravi
                                                                                                       url
1      https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott
2      https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler
3       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris
4 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman
5       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall
6       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi
  text
1   NA
2   NA
3   NA
4   NA
5   NA
6   NA

Retrieve unique page for each faculty member

jack_page <- read_html(spp$url[26]) 
jack_page

{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve unique page for each faculty member

jack_text <- jack_page %>% 
  html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
  html_text()

print(jack_text)

[1] "My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion."

We have the text for one person! How do we get this for all faculty members?

`for` loops

A for loop is a control structure in programming that allows repeating a set of operations multiple times.

It works by iterating over a sequence of elements (such as a vector or a list) and executing a block of code for each element in the sequence.

In R, the syntax for a for loop is as follows:

for (variable in sequence) {
  # code to be executed for each element in the sequence
}

Example:

for (i in 1:10) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

`for` loops

We can use a for loop to loop over the elements of our url variable

for(i in 1:nrow(spp)){

  # Load page for faculty member i
  faculty_member_page <- read_html(spp$url[i]) 
  
  # Extract text from that page
  faculty_member_text <- faculty_member_page %>%
                            html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
                            html_text() %>%
                            paste0(collapse = " ")
  
  # Save text for faculty member i
  spp$text[i] <- faculty_member_text
  
}

Output

spp[24,]

               name
26 Dr Jack Blumenau
                                                                                                    url
26 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-jack-blumenau
                                                                                                                                                                                                                                                                                                                                                                                                                                                 text
26 My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion.

spp[76,]

                     name
91 Professor Lisa Vanhala
                                                                                                          url
91 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-lisa-vanhala
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               text
91 I am interested in the politics of climate change and the socio-legal study of human rights and equality issues. My current, ERC-funded project, the Politics and Governance of Climate Change Loss and Damage (CCLAD), explores attempts to govern the impacts of climate change we will not be able to adapt to at a global and national level. Relying on a political ethnographic approach, the project explores the role of norms, identities and the micro-level, everyday dynamics of global environmental governance.

spp[74,]

                       name
87 Professor Jeffrey Howard
                                                                                                            url
87 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-jeffrey-howard
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        text
87 I currently direct the Digital Speech Lab, which hosts a range of research projects on the proper governance of online communications. Its purpose is to identify the fundamental principles that should guide the private and public regulation of online speech, and to trace those principles’ concrete implications in the face of difficult dilemmas about how best to respect free speech while preventing harm. The research team synthesizes expertise in political and moral philosophy, the philosophy of language, law and regulation, political science, and computer science. We engage a wide range of decisionmakers in industry, civil society, and policymaking. The Lab is funded by a UKRI Future Leaders Fellowship.

Topic Model for Departmental Research Interests

Let’s use this data to estimate a topic model
Two questions in this application:
- What are the topics that feature in the staff research profiles?
- Which staff members are most highly associated with each topic?s

Topic Model for Departmental Research Interests

library(stm)

## Create dfm
spp_corpus <- spp %>% 
  corpus(text_field = "text")

spp_dfm <- spp_corpus %>%
  tokens(remove_punct = TRUE, 
         remove_symbols = TRUE, 
         remove_numbers = TRUE, 
         remove_url = TRUE) %>%
  dfm() %>%
  dfm_remove(c(stopwords("en"), "book", "journal", "professor",
               "including", "include")) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 2) %>%
  dfm_trim(max_docfreq = .7, docfreq_type = "prop") 

## Estimate STM
stmOut <- stm(
            documents = spp_dfm,
            K = 12,
            seed = 123,
            verbose = FALSE
            )
         
#save(stmOut, file = "stmOut.Rdata")

Topic Model for Departmental Research Interests

plot(stmOut,
     labeltype = "frex")

Topic Model for Departmental Research Interests

Conclusion

Summing Up

Think about your research projects now!
- Start by identifying a paper that you might review
- Think about a substantive question you would like to answer
- Look for data that would help you to answer the question
There are several possible sources of data for these projects
- Existing datasources
- Data collected via an API
- Data collected via web-scraping
Data collection is a major part of any research project – it is good to practice this step!

5: Designing Projects and Collecting Text Data

Research Project Advice

Components

Criteria

Questions First, Data Second

Data First, Questions Second

Principles of Good Research Projects

Where To Find Good Research Questions

Where To Find Good Research Questions

Using Existing Datasets

Dataverse

Kaggle

APIs

APIs

API registration and authentication

http requests

API Output

API packages

Sample of APIs

API demonstration

Reddit API

Reddit API Application

Reddit API Application

Reddit API Application

Reddit API Application

Reddit API Application

Reddit API Application

Other functions

Break & Q&A

Web-scraping

Web scraping overview

Web-scraping – Legal and Ethical Issues

Web-scraping Demonstration

Web-scraping Demonstration

Source code

Load initial page

Parse HTML

Retrieve names for each member of faculty

Retrieve URL for each member of faculty

Storage

Retrieve unique page for each faculty member

Retrieve unique page for each faculty member

for loops

for loops

Output

Topic Model for Departmental Research Interests

Topic Model for Departmental Research Interests

Topic Model for Departmental Research Interests

Topic Model for Departmental Research Interests

Conclusion

Summing Up

`http` requests

`for` loops

`for` loops