5: Designing Projects and Collecting Text Data

Jack Blumenau

Research Project Advice

Components

  1. Review an Existing Text-as-Data Application (25%)

    • Any paper that uses quantitative text analysis of the type we study in the course
    • Must be from the social sciences (political science, public policy, economics, sociology, psychology, etc)
    • 1,000 words
    • Goal: Critical review of a single article
  2. Application of a Text-as-Data Method (75%)

    • Any method that we have studied on the course
    • Must be answering a social science research question
    • 2,000 words
    • Goals: Demonstrate understanding of method, and ability to apply, validate and interpret

Criteria

  1. Review an Existing Text-as-Data Application (25%)

    • Accuractely describe method and key implementation details
    • Discuss assumptions of method and whether they are met in this application
    • Assessment of strengths and weaknesses of the approach
    • Propose one alternative strategy for this application
  2. Application of a Text-as-Data Method (75%)

    • Clear and answerable research question
    • Accurate description of selected method
    • Discussion of assumptions behind the method
    • Description/Use of a dataset that we haven’t used on the course
    • Implementation of chosen method, with clear description/justification of analysis choices
    • Accurate interpretation of method output
    • Some attempt to validate chosen approach
    • Discussion of strengths and weaknesses of chosen approach
    • Some credit for originality and ambition of project

Note: Points in italics are only required for PGT students.

Questions First, Data Second

  1. Identify an interesting question

  2. Search for data that can answer that question

  3. Answer the question

Advantages:

  • Likely to lead to more interesting questions

    • A good research project answers an interesting question, and in this model you think about that first
  • Likely to lead to less time exploring different methods

    • Your question will guide you to a particular analysis
    • If you need to measure topics to answer your question, then use a topic model. If you need to measure complexity, then use a readability metric, etc.

Disadvantages:

  • Potentially large amount of time spent searching for/collecting data

    • The data you need may not exist in an easily accessible form
    • You will need to spend time and effort collecting it either by downloading it from various sources or getting it from an API or via web-scraping
  • Potential risk of finding nothing

    • The data you need may not exist at all!
    • Then you will need to find another question

Data First, Questions Second

  1. Identify an interesting dataset

  2. Explore that dataset using some of the tools we cover on the course

  3. Construct a research question that you can answer using that data

  4. Answer the question

Advantages:

  • Lower frustration

    • You will not spend lots of effort thinking about projects that have no hope!
  • Potentially faster

    • You will not spend lots of time trying to find data only to discover that it doesn’t exist

Disadvantages:

  • Potentially less interesting research question/answers

    • e.g. if the only data you can find is a BBC food archive, then you might have to write a silly paper all about the classification of curry receipes…
  • Often more limited metadata

    • The texts are themselves rarely sufficient for a compelling analysis. We often want metadata so that we can describe variation in some quantity of interest
    • Does your data include information about the authors of the documents? The dates they were produced? Etc.

Principles of Good Research Projects

  1. Be clear about the concept that you are trying to measure

    • Why is this an interesting concept?
    • Why is doing it this way an improvement on existing approaches?
  2. Discuss the assumptions behind your chosen method

    • What needs to be true in order for your measure to produce reasonable results?
  3. Provide some form of validation for your measure. E.g.

    • Hand-code a sample of documents and show that your quantitative text measure is associated with the human coding
    • Show that your measure passess basic face-validity checks (does it correlate with things in sensible ways, are the top-scoring texts sensible, etc)
  4. Demonstrate something interesting with the concept that you have measured

    • Finding some texts and implementing a method on the course is not sufficient – you need to answer some kind of social-science research question
    • Make use of metadata!
    • Show how the quantities you are estimating vary across groups/over time etc.

Where To Find Good Research Questions

Where To Find Good Research Questions

Thinking about the World

  • Read news articles

  • Discuss potential applications with classmates

  • Good starting points:

    • The Economist
    • The Financial Times
    • Podcasts

Using Existing Datasets

Dataverse

  • The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.

  • Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper

  • This is an excellent source of data for your projects!

  • Sometimes it can take a bit of searching through the files in each repository to figure out where the data is

Kaggle

  • Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)

  • Many of these datasets are potentially interesting to social scientists, e.g.

    • New York Times Comments
    • Tweets about COVID-19
    • Corpus of Academic Papers relating to COVID19
    • Donald Trump’s Rally Speeches
    • etc
  • Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on

APIs

APIs

  • API: Application Programming Interface — a way for two pieces of software to talk to each other

  • Your software can receive (and also send) data automatically through these services

  • Data is sent by — the same way your browser does it

  • Most services have helping code (known as a wrapper) to construct http requests

  • Both the wrapper and the service itself are called APIs

  • http service also sometimes known as REST (REpresentational State Transfer)

API registration and authentication

  • APIs typically require you to register for an API key to allow access

    • Many are not free, at least for large-scale use
  • Before you commit to using a given API, check what the rate limits are on its use

    • Limits on total number of requests for a given user
    • Limits on the total number of requests in a given day/minute/hour etc
  • Make sure you register with the service in plenty of time to actually get the data!

  • Once registered, you will have access to some kind of key that will allow you to access the API

http requests

It is helpful to start paying attention to the structure of basic http requests.

For instance, let’s say we want to get some data from the TheyWorkForYou api.

A test request:

https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX

  • Parameters to the API are encoded in the URL

    • output = Which format do you want returned?
    • search = Return speeches with which words?
    • num = number requested
    • key = access key

API Output

  • The output of an API will typically not be in csv or Rdata format

  • Often, though not always, it will be in either JSON and XML

    • XML: eXtensible Markup Language

    • JSON : JavaScript Object Notation

  • If you have a choice, you probably want JSON

  • Both types of file are easily read into R

  • json_lite and xml2 are the relevant packages

API packages

  • It’s not usually necessary to construct these kind of requests yourself

  • R, Python, and other programming languages have libraries to make it easier – but you have to find them!

  • I have provided a sample of APIs that have associated R packages on the next slide

  • The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.

Sample of APIs

There are many existing R packages that make it straightforward to retreive data from an API:

API R package Description
Twitter install.packages("rtweet") Twitter, small-scale use, no longer free!
Guardian Newspaper install.packages("guardianapi") Full Guardian archive, 1999-present
Wikipedia install.packages("WikipediR") Wikipedia data and knowledge graph
TheyWorkForYou install.packages("twfy") Speeches from the UK House of Commons and Lords
ProPublica Congress API install.packages("ProPublicaR") Data from the US Congress
Google Books Ngrams install.packages("ngramr") Ngrams in Google Books, 1500-present
Reddit install.packages("RedditExtractoR") Subreddits, users, urls, texts of posts

Warning: I have not tested all of these!

API demonstration

Reddit API

  • We will use the Reddit API to search for subreddits on UK Politics in the past year

  • For this example, we are not collecting a large amount of data

  • In general, you need to created a authenticated client id

  • Rate limits: currently 100 queries per minute (QPM) per OAuth client id

  • We will use library(RedditExtractoR)

Reddit API Application

library(RedditExtractoR) 

subred <- find_subreddits("UK politics")
glimpse(subred)
Rows: 233
Columns: 7
$ id          <chr> "6c6t7t", "2rgbp", "483mxu", "2vorv", "324zi", "bxukg", "2…
$ date_utc    <chr> "2022-05-08", "2010-01-21", "2021-04-08", "2012-12-01", "2…
$ timestamp   <dbl> 1652004172, 1264066642, 1617865911, 1354391253, 1402456586…
$ subreddit   <chr> "DeppDelusion", "piratepartyofcanada", "Divisive_Babble", …
$ title       <chr> "Snapping you out of the delusion that Johnny Depp is a vi…
$ description <chr> "For people who feel gaslighted by the mainstream opinion …
$ subscribers <dbl> 27001, 508, 1015, 6956, 2423058, 0, 23117, 879, 21440, 0, …

Reddit API Application

Let’s clean the dataset and order those subreddits by number of subscribers.

subred <- subred %>%
  
  # selects variables you want to explore
  select(subreddit, title, description, subscribers) %>% 
  
  # creates new variables
  mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>% 
  
  # arranges data from highest subscriber count
  arrange(desc(subscribers_million)) 

head(subred[c("subreddit", "title", "subscribers_million")])
      subreddit                            title subscribers_million
2qh1i AskReddit                    Ask Reddit...           51.217430
2qh13 worldnews                       World News           43.993982
2qjpg     memes /r/Memes the original since 2008           35.311800
2qh55      food    Welcome to /r/Food on Reddit!           24.303083
2cneq  politics                         Politics            8.725471
2qh4j    europe                           Europe            8.403108

Reddit API Application

Many are not about UK politics!

Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.

uk.subred <- subred %>%
  filter(grepl("UK politic|British politic", description, ignore.case = TRUE) |
  grepl("UK politic|British politic", title, ignore.case = TRUE))

head(uk.subred[c("subreddit", "title", "subscribers_million")])
                 subreddit                    title subscribers_million
2qhcv           ukpolitics              UK Politics            0.511380
30c1v             LabourUK The British Labour Party            0.066979
tzpe1    UKPoliticalComedy      UK Political Comedy            0.060137
33geh          UK_Politics              UK Politics            0.004905
2qo8i           PoliticsUK   UK Politics Discussion            0.002722
31c96 UKPoliticsDiscussion   UK Politics Discussion            0.000548

Reddit API Application

The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics

Let’s have a look at these two:

uk.comedy <- find_thread_urls(subreddit = 'UKPoliticalComedy', sort_by = 'top', period = 'all')
uk.politics <- find_thread_urls(subreddit = 'ukpolitics', sort_by = 'top', period = 'all')

What is in this data?

glimpse(uk.comedy) 
Rows: 1,001
Columns: 7
$ date_utc  <chr> NA, "2020-12-01", "2020-09-08", "2020-08-10", "2020-07-30", …
$ timestamp <dbl> NA, 1606834722, 1599605761, 1597046698, 1596104216, 15949771…
$ title     <chr> NA, "New from Starmzy - (TheIainDuncanSmiths on Twitter)", "…
$ text      <chr> NA, "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> NA, "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalCo…
$ comments  <dbl> NA, 2, 1, 15, 8, 4, 0, 0, 1, 0, 0, 6, 0, 0, 8, 0, 1, 3, 3, 1…
$ url       <chr> NA, "https://www.reddit.com/r/UKPoliticalComedy/comments/k4m…
glimpse(uk.politics) 
Rows: 955
Columns: 7
$ date_utc  <chr> "2022-01-10", "2020-12-16", "2020-11-19", "2019-09-30", "201…
$ timestamp <dbl> 1641774825, 1608100800, 1605782676, 1569828606, 1566291684, …
$ title     <chr> "Downing Street has formally denied a FOI request asking for…
$ text      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments  <dbl> 232, 475, 187, 358, 303, 182, 428, 1544, 347, 1850, 271, 56,…
$ url       <chr> "https://www.reddit.com/r/ukpolitics/comments/s067l9/downing…

Reddit API Application

We will work on the titles of those threads to measure sentiment with a dictionary based approach

make.dfm <- function(data){
    dfm <- data %>%
        corpus(text_field = "title") %>%
        tokens(remove_punct = TRUE,
               remove_symbols = TRUE,
               remove_url = TRUE) %>%
        dfm() %>%
        dfm_remove(stopwords("en")) %>%
        dfm_trim(min_termfreq = 3,
                 max_docfreq = .9,
                 docfreq_type = "prop") %>%
    
        # New step: Keep words of at least 3 characters
        dfm_select(pattern = "\\b\\w{3,}\\b", valuetype = "regex", selection = "keep")

  return(dfm)
}


comedy.dfm <- make.dfm(uk.comedy)
politics.dfm <- make.dfm(uk.politics)

How many features in those DFMs?

dim(comedy.dfm)
[1] 1001  293
dim(politics.dfm)
[1]  955 1078

Reddit API Application

Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015

dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015)%>% 
  dfm_remove(c("neg_positive", "neg_negative")) %>%
  dfm_weight(scheme = "logave")  %>%
  convert("data.frame") %>%
  mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>% 
  mutate(neutral = positive == negative) %>% 
  colMeans(na.rm = TRUE) %>% print()
  negative   positive    neutral 
0.05694306 0.11888112 0.82817183 
dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>% 
  dfm_remove(c("neg_positive", "neg_negative")) %>%
  dfm_weight(scheme = "logave")  %>%
  convert("data.frame") %>%
  mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>% 
  mutate(neutral = positive == negative) %>% 
  colMeans(na.rm = TRUE) %>% print()
 negative  positive   neutral 
0.3434555 0.2523560 0.5424084 

Other functions

# Get thread content, for given URLs
get_thread_content()
# Get information on a particular user, for given list of users
get_user_content()

You could use those functions to explore:

  • The correlation between comment sentiment and upvotes/downvotes

  • Topics across sub-reddits

  • How conversation evolve depending on the topic

  • … and many other research questions!

Break & Q&A


If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com

Web-scraping

Web scraping overview

Key steps in any web-scraping project:

  1. Work out how the website is structured

  2. Work out how links connect different pages

  3. Isolate the information you care about on each page

  4. Write a loop which connects 3 to 2, and saves the information you want from each page

  5. Put it all into a nice and tidy data.frame

  6. Feel like a superhero

(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)

Web-scraping Demonstration

Web-scraping Demonstration

  • We will scrape the research interests of members of faculty in the Department of Political Science at UCL

  • The departmental website has a list of faculty members

  • Each member of the department has a unique page

  • The research interests of the faculty member are stored on their unique page

  • Let’s look at an exmple…

Source code

  • To collect the information we want, we need to see how it is stored within the html code that underpins the website

  • Webpages include much more than what is immediately visible to visitors

  • Crucially, they include code which provides structure, style and functionality (which your browser interprets)

    • HTML provides strucutre
    • css provides style
    • JavaScript provides functionality
  • To implement a web-scraper, we have to work directly with the source code

    • Identifying the information on each page that we want to extract
    • Identifying links between pages that help us to navigate the page programmatically

To see the source code, use Ctrl + U or right click and select View/Show Page Source

Load initial page

We can read the source code of any website into R using the readLines() function.

library(tidyverse)

spp_home <- "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff"

spp_html <- readLines(spp_home)
spp_html[1:20]
 [1] "<!DOCTYPE html>"                                                                                                                                      
 [2] "<!--[if IE 7]>"                                                                                                                                       
 [3] "<html lang=\"en\" class=\"lt-ie9 lt-ie8 no-js\"> <![endif]-->"                                                                                        
 [4] "<!--[if IE 8]>"                                                                                                                                       
 [5] "<html lang=\"en\" class=\"lt-ie9 no-js\"> <![endif]-->"                                                                                               
 [6] "<!--[if gt IE 8]><!-->"                                                                                                                               
 [7] "<html lang=\"en\" class=\"no-js\"> <!--<![endif]-->"                                                                                                  
 [8] "<head>"                                                                                                                                               
 [9] "  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"                                                                        
[10] "  <meta name=\"author\" content=\"UCL\"/>"                                                                                                            
[11] "  <meta property=\"og:profile_id\" content=\"uclofficial\"/>"                                                                                         
[12] "  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"                                                                          
[13] "<link rel=\"shortcut icon\" href=\"https://www.ucl.ac.uk/political-science/sites/all/themes/indigo/favicon.ico\" type=\"image/vnd.microsoft.icon\" />"
[14] "<link rel=\"canonical\" href=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"                              
[15] "<meta name=\"ucl:faculty\" content=\"Social &amp; Historical Sciences\" />"                                                                           
[16] "<meta property=\"og:site_name\" content=\"Department of Political Science\" />"                                                                       
[17] "<meta name=\"ucl:sanitized_org_unit\" content=\"Department of Political Science\" />"                                                                 
[18] "<meta property=\"og:type\" content=\"website\" />"                                                                                                    
[19] "<meta property=\"og:title\" content=\"Academic, Teaching, and Research Staff\" />"                                                                    
[20] "<meta property=\"og:url\" content=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"                         
spp_html[grep("Professor Ben", spp_html)[1]]
[1] "<li><a href=\"/political-science/people/academic-teaching-and-research-staff/professor-benjamin-lauderdale\" class=\"nav-item\">Professor Benjamin Lauderdale</a></li>"

This is helpful, but it is awkward to navigate the source code directly.

Parse HTML

The read_html function in the rvest package allows us to read the HTML in a more structured format:

library(rvest)

spp <- read_html(spp_home)

spp
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve names for each member of faculty

We can then navigate through the HTML by searching for elements that have common elements (using html_elements()):

spp_faculty_elements <- spp %>% html_elements("a[class='nav-item']") 

head(spp_faculty_elements)
{xml_nodeset (6)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[2] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[3] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[4] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[5] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[6] <a href="/political-science/people/academic-teaching-and-research-staff/d ...

The names of each faculty member are stored in the text associated with these elements:

[1] "<a href=\"/political-science/people/academic-teaching-and-research-staff/dr-jared-j-finnegan\" class=\"nav-item\">Dr Jared Finnegan</a>"

We can extract these names using the html_text() function:

spp_faculty_names <- spp_faculty_elements %>% html_text()
head(spp_faculty_names)
[1] "Andrew Scott"         "Bugra Susler"         "Dr Adam Harris"      
[4] "Dr Alexandra Hartman" "Dr Amanda Hall"       "Dr Aparna Ravi"      

Retrieve URL for each member of faculty

The URL for each faculty member is stored in the href attribute of the elements:

# html_attr() retrieves the attributes associated with the elements that we extracted above
spp_urls <- spp_faculty_elements %>% html_attr("href") 

head(spp_urls)
[1] "/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"      
# paste0() joins strings together
spp_urls <- paste0("https://www.ucl.ac.uk", spp_urls)

head(spp_urls)
[1] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"      

Storage

spp <- data.frame(name = spp_faculty_names, 
                  url = spp_urls, 
                  text = NA)

head(spp)
                  name
1         Andrew Scott
2         Bugra Susler
3       Dr Adam Harris
4 Dr Alexandra Hartman
5       Dr Amanda Hall
6       Dr Aparna Ravi
                                                                                                       url
1      https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott
2      https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler
3       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris
4 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman
5       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall
6       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi
  text
1   NA
2   NA
3   NA
4   NA
5   NA
6   NA

Retrieve unique page for each faculty member

jack_page <- read_html(spp$url[26]) 
jack_page
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve unique page for each faculty member

jack_text <- jack_page %>% 
  html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
  html_text()
print(jack_text)
[1] "My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion."

We have the text for one person! How do we get this for all faculty members?

for loops

  • A for loop is a control structure in programming that allows repeating a set of operations multiple times.
  • It works by iterating over a sequence of elements (such as a vector or a list) and executing a block of code for each element in the sequence.
  • In R, the syntax for a for loop is as follows:
for (variable in sequence) {
  # code to be executed for each element in the sequence
}
  • Example:
for (i in 1:10) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

for loops

We can use a for loop to loop over the elements of our url variable

for(i in 1:nrow(spp)){

  # Load page for faculty member i
  faculty_member_page <- read_html(spp$url[i]) 
  
  # Extract text from that page
  faculty_member_text <- faculty_member_page %>%
                            html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
                            html_text() %>%
                            paste0(collapse = " ")
  
  # Save text for faculty member i
  spp$text[i] <- faculty_member_text
  
}

Output

spp[24,]
               name
26 Dr Jack Blumenau
                                                                                                    url
26 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-jack-blumenau
                                                                                                                                                                                                                                                                                                                                                                                                                                                 text
26 My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion.
spp[76,]
                     name
91 Professor Lisa Vanhala
                                                                                                          url
91 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-lisa-vanhala
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               text
91 I am interested in the politics of climate change and the socio-legal study of human rights and equality issues. My current, ERC-funded project, the Politics and Governance of Climate Change Loss and Damage (CCLAD), explores attempts to govern the impacts of climate change we will not be able to adapt to at a global and national level. Relying on a political ethnographic approach, the project explores the role of norms, identities and the micro-level, everyday dynamics of global environmental governance.   
spp[74,]
                       name
87 Professor Jeffrey Howard
                                                                                                            url
87 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/professor-jeffrey-howard
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        text
87 I currently direct the Digital Speech Lab, which hosts a range of research projects on the proper governance of online communications. Its purpose is to identify the fundamental principles that should guide the private and public regulation of online speech, and to trace those principles’ concrete implications in the face of difficult dilemmas about how best to respect free speech while preventing harm. The research team synthesizes expertise in political and moral philosophy, the philosophy of language, law and regulation, political science, and computer science. We engage a wide range of decisionmakers in industry, civil society, and policymaking. The Lab is funded by a UKRI Future Leaders Fellowship. 

Topic Model for Departmental Research Interests

  • Let’s use this data to estimate a topic model

  • Two questions in this application:

    • What are the topics that feature in the staff research profiles?
    • Which staff members are most highly associated with each topic?s

Topic Model for Departmental Research Interests

library(stm)

## Create dfm
spp_corpus <- spp %>% 
  corpus(text_field = "text")

spp_dfm <- spp_corpus %>%
  tokens(remove_punct = TRUE, 
         remove_symbols = TRUE, 
         remove_numbers = TRUE, 
         remove_url = TRUE) %>%
  dfm() %>%
  dfm_remove(c(stopwords("en"), "book", "journal", "professor",
               "including", "include")) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 2) %>%
  dfm_trim(max_docfreq = .7, docfreq_type = "prop") 

## Estimate STM
stmOut <- stm(
            documents = spp_dfm,
            K = 12,
            seed = 123,
            verbose = FALSE
            )
         
#save(stmOut, file = "stmOut.Rdata")

Topic Model for Departmental Research Interests

plot(stmOut,
     labeltype = "frex")

Topic Model for Departmental Research Interests

Conclusion

Summing Up

  1. Think about your research projects now!

    • Start by identifying a paper that you might review
    • Think about a substantive question you would like to answer
    • Look for data that would help you to answer the question
  2. There are several possible sources of data for these projects

    • Existing datasources
    • Data collected via an API
    • Data collected via web-scraping
  3. Data collection is a major part of any research project – it is good to practice this step!