5: Designing Projects and Collecting Text Data

Jack Blumenau & Gloria Gennaro

Early Feedback Results

Early Feedback Results

Qualitative comments focused on three main themes:

  1. Instruction on the final project

    • We will talk about this today
  2. More support on math / coding

    • I will try to spend more time on the math and coding
    • I encourage you to come to office hours if this is not enough
  3. “Complete analysis case”

    • Text analysis projects have many components, we are bulding towards that objective
    • The readings are good examples of projects that you should be able to complete at the end of this course

Research Project Advice

Components

  1. Review an Existing Text-as-Data Application (25%)

    • Any paper that uses quantitative text analysis of the type we study in the course
    • Must be from the social sciences (political science, public policy, economics, sociology, psychology, etc)
    • 1,000 words
    • Goal: Critical review of a single article”
  2. Application of a Text-as-Data Method (75%)

    • Any method that we have studied on the course
    • Must be answering a social science research question
    • 2,000 words
    • Goals: Demonstrate understanding of method, and ability to apply, validate and interpret

Criteria

  1. Review an Existing Text-as-Data Application (25%)

    • Accuractely describe method and key implementation details
    • Discuss assumptions of method and whether they are met in this application
    • Assessment of strengths and weaknesses of the approach
    • Propose one alternative strategy for this application
  2. Application of a Text-as-Data Method (75%)

    • Clear and answerable research question
    • Accurate description of selected method
    • Discussion of assumptions behind the method
    • Use of a dataset that we haven’t used on the course
    • Implementation of chosen method, with clear description/justification of analysis choices
    • Accurate interpretation of method output
    • Some attempt to validate chosen approach
    • Discussion of strengths and weaknesses of chosen approach
    • Some credit for originality and ambition of project

Questions First, Data Second

  1. Identify an interesting question

  2. Search for data that can answer that question

  3. Answer the question

Advantages:

  • Likely to lead to more interesting questions

    • A good research project answers an interesting question, and in this model you think about that first
  • Likely to lead to less time exploring different methods

    • Your question will guide you to a particular analysis
    • If you need to measure topics to answer your question, then use a topic model. If you need to measure complexity, then use a readability metric, etc.

Disadvantages:

  • Potentially large amount of time spent searching for/collecting data

    • The data you need may not exist in an easily accessible form
    • You will need to spend time and effort collecting it either by downloading it from various sources or getting it from an API or via web-scraping
  • Potential risk of finding nothing

    • The data you need may not exist at all!
    • Then you will need to find another question

Data First, Questions Second

  1. Identify an interesting dataset

  2. Explore that dataset using some of the tools we cover on the course

  3. Construct a research question that you can answer using that data

  4. Answer the question

Advantages:

  • Lower frustration

    • You will not spend lots of effort thinking about projects that have no hope!
  • Potentially faster

    • You will not spend lots of time trying to find data only to discover that it doesn’t exist

Disadvantages:

  • Potentially less interesting research question/answers

    • e.g. if the only data you can find is a BBC food archive, then you might have to write a silly paper all about the classification of curry receipes…
  • Often more limited metadata

    • The texts are themselves rarely sufficient for a compelling analysis. We often want metadata so that we can describe variation in some quantity of interest
    • Does your data include information about the authors of the documents? The dates they were produced? Etc.

Principles of Good Research Projects

  1. Be clear about the concept that you are trying to measure

    • Why is this an interesting concept?
    • Why is doing it this way an improvement on existing approaches?
  2. Discuss the assumptions behind your chosen method

    • What needs to be true in order for your measure to produce reasonable results?
  3. Provide some form of validation for your measure. E.g.

    • Hand-code a sample of documents and show that your quantitative text measure is associated with the human coding
    • Show that your measure passess basic face-validity checks (does it correlate with things in sensible ways, are the top-scoring texts sensible, etc)
  4. Demonstrate something interesting with the concept that you have measured

    • Finding some texts and implementing a method on the course is not sufficient – you need to answer some kind of social-science research question
    • Make use of metadata!
    • Show how the quantities you are estimating vary across groups/over time etc.

Where To Find Good Research Questions

Where To Find Good Research Questions

Thinking about the World

  • Read news articles

  • Discuss potential applications with classmates

  • Good starting points:

    • The Economist
    • The Financial Times
    • Podcasts

Using Existing Datasets

Dataverse

  • The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.

  • Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper

  • This is an excellent source of data for your projects!

  • Sometimes it can take a bit of searching through the files in each repository to figure out where the data is

Kaggle

  • Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)

  • Many of these datasets are potentially interesting to social scientists, e.g.

    • New York Times Comments
    • Tweets about COVID-19
    • Corpus of Academic Papers relating to COVID19
    • Donald Trump’s Rally Speeches
    • etc
  • Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on

APIs

APIs

  • API: Application Programming Interface — a way for two pieces of software to talk to each other

  • Your software can receive (and also send) data automatically through these services

  • Data is sent by — the same way your browser does it

  • Most services have helping code (known as a wrapper) to construct http requests

  • Both the wrapper and the service itself are called APIs

  • http service also sometimes known as REST (REpresentational State Transfer)

API registration and authentication

  • APIs typically require you to register for an API key to allow access

    • Many are not free, at least for large-scale use
  • Before you commit to using a given API, check what the rate limits are on its use

    • Limits on total number of requests for a given user
    • Limits on the total number of requests in a given day/minute/hour etc
  • Make sure you register with the service in plenty of time to actually get the data!

  • Once registered, you will have access to some kind of key that will allow you to access the API

http requests

It is helpful to start paying attention to the structure of basic http requests.

For instance, let’s say we want to get some data from the TheyWorkForYou api.

A test request:

https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX

  • Parameters to the API are encoded in the URL

    • output = Which format do you want returned?
    • search = Return speeches with which words?
    • num = number requested
    • key = access key

API Output

  • The output of an API will typically not be in csv or Rdata format

  • Often, though not always, it will be in either JSON and XML

    • XML: eXtensible Markup Language

    • JSON : JavaScript Object Notation

  • If you have a choice, you probably want JSON

  • Both types of file are easily read into R

  • json_lite and xml2 are the relevant packages

API packages

  • It’s not usually necessary to construct these kind of requests yourself

  • R, Python, and other programming languages have libraries to make it easier – but you have to find them!

  • I have provided a sample of APIs that have associated R packages on the next slide

  • The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.

Sample of APIs

There are many existing R packages that make it straightforward to retreive data from an API:

API R package Description
Twitter install.packages("rtweet") Twitter, small-scale use, no longer free!
Guardian Newspaper install.packages("guardianapi") Full Guardian archive, 1999-present
Wikipedia install.packages("WikipediR") Wikipedia data and knowledge graph
TheyWorkForYou install.packages("twfy") Speeches from the UK House of Commons and Lords
ProPublica Congress API install.packages("ProPublicaR") Data from the US Congress
Google Books Ngrams install.packages("ngramr") Ngrams in Google Books, 1500-present
Reddit install.packages("RedditExtractoR")

Warning: I have not tested all of these!

API demonstration

Reddit API

  • We will use the Reddit API to search for subreddits on UK Politics in the past year

  • For this example, we are not collecting a large amount of data

  • In general, you need to created a authenticated client id

  • Rate limits: currently 100 queries per minute (QPM) per OAuth client id

  • We will use library(RedditExtractoR)

Reddit API Application

library(RedditExtractoR) 

subred <- find_subreddits("UK politics")
glimpse(subred)
Rows: 231
Columns: 7
$ id          <chr> "27hnjr", "3ahdz", "6c6t7t", "2vorv", "2rgbp", "2zc4g", "3…
$ date_utc    <chr> "2019-10-30", "2015-10-25", "2022-05-08", "2012-12-01", "2…
$ timestamp   <dbl> 1572393839, 1445747908, 1652004172, 1354391253, 1264066642…
$ subreddit   <chr> "HouseOfTheDragon", "MCBC", "DeppDelusion", "UrbanStudies"…
$ title       <chr> "House of the Dragon", "Model Canadian Broadcasting Corpor…
$ description <chr> "This is a place for news and discussions relating to HBO'…
$ subscribers <dbl> 1054533, 67, 23214, 6665, 498, 80377, 859, 0, 0, 20444, 86…

Reddit API Application

Let’s clean the dataset and order those subreddits by number of subscribers.

subred <- subred %>%
  
  # selects variables you want to explore
  select(subreddit, title, description, subscribers) %>% 
  
  # creates new variables
  mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>% 
  
  # arranges data from highest subscriber count
  arrange(desc(subscribers_million)) 

head(subred[c("subreddit", "title", "subscribers_million")])
              subreddit                             title subscribers_million
2qh1i         AskReddit                     Ask Reddit...           44.940265
2qh13         worldnews                        World News           34.855445
2qjpg             memes  /r/Memes the original since 2008           29.655626
2cneq          politics                          Politics            8.471319
2qh4j            europe                            Europe            5.729439
2w844 NoStupidQuestions No such thing as stupid questions            4.326992

Reddit API Application

Many are not about UK politics!

Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.

uk.subred <- subred %>%
  filter(grepl("UK politic|British politic", description, ignore.case = TRUE) |
  grepl("UK politic|British politic", title, ignore.case = TRUE))

head(uk.subred[c("subreddit", "title", "subscribers_million")])
              subreddit                    title subscribers_million
2qhcv        ukpolitics              UK Politics            0.477389
30c1v          LabourUK The British Labour Party            0.064691
tzpe1 UKPoliticalComedy      UK Political Comedy            0.059778
33geh       UK_Politics              UK Politics            0.004740
2qo8i        PoliticsUK   UK Politics Discussion            0.002090
3euqz  casualukpolitics       casual UK politics            0.000474

Reddit API Application

The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics

Let’s have a look at these two:

uk.comedy <- find_thread_urls(subreddit = 'UKPoliticalComedy', sort_by = 'top', period = 'all')
uk.politics <- find_thread_urls(subreddit = 'ukpolitics', sort_by = 'top', period = 'all')

What is in this data?

glimpse(uk.comedy) 
Rows: 999
Columns: 7
$ date_utc  <chr> "2022-02-08", "2021-11-10", "2021-09-24", "2021-09-02", "202…
$ timestamp <dbl> 1644362664, 1636564393, 1632497067, 1630616527, 1627022456, …
$ title     <chr> "Palps new job", "Sums things up pretty well.", "I am so unp…
$ text      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalComedy…
$ comments  <dbl> 0, 4, 0, 1, 3, 5, 6, 0, 0, 2, 3, 3, 4, 2, 2, 2, 1, 1, 3, 1, …
$ url       <chr> "https://www.reddit.com/r/UKPoliticalComedy/comments/sny3f5/…
glimpse(uk.politics) 
Rows: 955
Columns: 7
$ date_utc  <chr> "2020-11-19", "2019-09-30", "2019-08-20", "2022-01-20", "202…
$ timestamp <dbl> 1605782676, 1569828606, 1566291684, 1642682212, 1600764671, …
$ title     <chr> "Government finally admits only £3bn of money for green reco…
$ text      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments  <dbl> 187, 360, 304, 182, 428, 1549, 347, 1866, 273, 56, 216, 620,…
$ url       <chr> "https://www.reddit.com/r/ukpolitics/comments/jx0kaz/governm…

Reddit API Application

We will work on the titles of those threads to measure sentiment with a dictionary based approach

make.dfm <- function(data){
    dfm <- data %>%
        corpus(text_field = "title") %>%
        tokens(remove_punct = TRUE,
               remove_symbols = TRUE,
               remove_url = TRUE) %>%
        dfm() %>%
        dfm_remove(stopwords("en")) %>%
        dfm_trim(min_termfreq = 3,
                 max_docfreq = .9,
                 docfreq_type = "prop") %>%
    
        # New step: Keep words of at least 3 characters
        dfm_select(pattern = "\\b\\w{3,}\\b", valuetype = "regex", selection = "keep")

  return(dfm)
}


comedy.dfm <- make.dfm(uk.comedy)
politics.dfm <- make.dfm(uk.politics)

How many features in those DFMs?

dim(comedy.dfm)
[1] 999 298
dim(politics.dfm)
[1]  955 1075

Reddit API Application

Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015

dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015)%>% 
  dfm_remove(c("neg_positive", "neg_negative")) %>%
  dfm_weight(scheme = "logave")  %>%
  convert("data.frame") %>%
  mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>% 
  mutate(neutral = positive == negative) %>% 
  colMeans(na.rm = TRUE) %>% print()
  negative   positive    neutral 
0.05205205 0.12212212 0.82982983 
dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>% 
  dfm_remove(c("neg_positive", "neg_negative")) %>%
  dfm_weight(scheme = "logave")  %>%
  convert("data.frame") %>%
  mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>% 
  mutate(neutral = positive == negative) %>% 
  colMeans(na.rm = TRUE) %>% print()
 negative  positive   neutral 
0.3434555 0.2502618 0.5403141 

Other functions

# Get thread content, for given URLs
get_thread_content()
# Get information on a particular user, for given list of users
get_user_content()

You could use those functions to explore:

  • The correlation between comment sentiment and upvotes/downvotes

  • Topics across sub-reddits

  • How conversation evolve depending on the topic

  • … and many other research questions!

Break & Q&A


If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com

Web-scraping

Web scraping overview

Key steps in any web-scraping project:

  1. Work out how the website is structured

  2. Work out how links connect different pages

  3. Isolate the information you care about on each page

  4. Write a loop which connects 3 to 2, and saves the information you want from each page

  5. Put it all into a nice and tidy data.frame

  6. Feel like a superhero

(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)

Web-scraping Demonstration

Web-scraping Demonstration

  • We will scrape the research interests of members of faculty in the Department of Political Science at UCL

  • The departmental website has a list of faculty members

  • Each member of the department has a unique page

  • The research interests of the faculty member are stored on their unique page

  • Let’s look at an exmple…

Source code

  • To collect the information we want, we need to see how it is stored within the html code that underpins the website

  • Webpages include much more than what is immediately visible to visitors

  • Crucially, they include code which provides structure, style and functionality (which your browser interprets)

    • HTML provides strucutre
    • css provides style
    • JavaScript provides functionality
  • To implement a web-scraper, we have to work directly with the source code

    • Identifying the information on each page that we want to extract
    • Identifying links between pages that help us to navigate the page programmatically

To see the source code, use Ctrl + U or right click and select View/Show Page Source

Load initial page

We can read the source code of any website into R using the readLines() function.

library(tidyverse)

spp_home <- "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff"

spp_html <- readLines(spp_home)
spp_html[1:20]
 [1] "<!DOCTYPE html>"                                                                                                                                      
 [2] "<!--[if IE 7]>"                                                                                                                                       
 [3] "<html lang=\"en\" class=\"lt-ie9 lt-ie8 no-js\"> <![endif]-->"                                                                                        
 [4] "<!--[if IE 8]>"                                                                                                                                       
 [5] "<html lang=\"en\" class=\"lt-ie9 no-js\"> <![endif]-->"                                                                                               
 [6] "<!--[if gt IE 8]><!-->"                                                                                                                               
 [7] "<html lang=\"en\" class=\"no-js\"> <!--<![endif]-->"                                                                                                  
 [8] "<head>"                                                                                                                                               
 [9] "  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"                                                                        
[10] "  <meta name=\"author\" content=\"UCL\"/>"                                                                                                            
[11] "  <meta property=\"og:profile_id\" content=\"uclofficial\"/>"                                                                                         
[12] "  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"                                                                          
[13] "<link rel=\"shortcut icon\" href=\"https://www.ucl.ac.uk/political-science/sites/all/themes/indigo/favicon.ico\" type=\"image/vnd.microsoft.icon\" />"
[14] "<link rel=\"canonical\" href=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"                              
[15] "<meta name=\"ucl:faculty\" content=\"Social &amp; Historical Sciences\" />"                                                                           
[16] "<meta property=\"og:site_name\" content=\"Department of Political Science\" />"                                                                       
[17] "<meta name=\"ucl:sanitized_org_unit\" content=\"Department of Political Science\" />"                                                                 
[18] "<meta property=\"og:type\" content=\"website\" />"                                                                                                    
[19] "<meta property=\"og:title\" content=\"Academic, Teaching, and Research Staff\" />"                                                                    
[20] "<meta property=\"og:url\" content=\"https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff\" />"                         
spp_html[grep("Professor Ben", spp_html)[1]]
[1] "<li><a href=\"/political-science/people/academic-teaching-and-research-staff/professor-benjamin-lauderdale\" class=\"nav-item\">Professor Benjamin Lauderdale</a></li>"

This is helpful, but it is awkward to navigate the source code directly.

Parse HTML

The read_html function in the rvest package allows us to read the HTML in a more structured format:

library(rvest)

spp <- read_html(spp_home)

spp
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve names for each member of faculty

We can then navigate through the HTML by searching for elements that have common elements (using html_elements()):

spp_faculty_elements <- spp %>% html_elements("a[class='nav-item']") 

head(spp_faculty_elements)
{xml_nodeset (6)}
[1] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[2] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[3] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[4] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[5] <a href="/political-science/people/academic-teaching-and-research-staff/d ...
[6] <a href="/political-science/people/academic-teaching-and-research-staff/d ...

The names of each faculty member are stored in the text associated with these elements:

[1] "<a href=\"/political-science/people/academic-teaching-and-research-staff/dr-jeremy-bowles\" class=\"nav-item\">Dr Jeremy Bowles</a>"

We can extract these names using the html_text() function:

spp_faculty_names <- spp_faculty_elements %>% html_text()
head(spp_faculty_names)
[1] "Andrew Scott"         "Bugra Susler"         "Dr Adam Harris"      
[4] "Dr Alexandra Hartman" "Dr Amanda Hall"       "Dr Aparna Ravi"      

Retrieve URL for each member of faculty

The URL for each faculty member is stored in the href attribute of the elements:

# html_attr() retrieves the attributes associated with the elements that we extracted above
spp_urls <- spp_faculty_elements %>% html_attr("href") 

head(spp_urls)
[1] "/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"      
# paste0() joins strings together
spp_urls <- paste0("https://www.ucl.ac.uk", spp_urls)

head(spp_urls)
[1] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott"     
[2] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler"     
[3] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris"      
[4] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman"
[5] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall"      
[6] "https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi"      

Storage

spp <- data.frame(name = spp_faculty_names, 
                  url = spp_urls, 
                  text = NA)

head(spp)
                  name
1         Andrew Scott
2         Bugra Susler
3       Dr Adam Harris
4 Dr Alexandra Hartman
5       Dr Amanda Hall
6       Dr Aparna Ravi
                                                                                                       url
1      https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-andrew-scott
2      https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-bugra-susler
3       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-adam-harris
4 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-alexandra-hartman
5       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-amanda-hall
6       https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-aparna-ravi
  text
1   NA
2   NA
3   NA
4   NA
5   NA
6   NA

Retrieve unique page for each faculty member

jack_page <- read_html(spp$url[24]) 
jack_page
{html_document}
<html lang="en" class="no-js">
[1] <head>\n<meta name="viewport" content="width=device-width, initial-scale= ...
[2] <body class="html not-front not-logged-in no-sidebars page-node page-node ...

Retrieve unique page for each faculty member

jack_text <- jack_page %>% 
  html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
  html_text()
print(jack_text)
[1] "My research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems. In my research, I employ creative research designs in which I develop and apply state-of-the-art quantitative methods to answer important questions in the fields of legislative politics, electoral politics, and public opinion."

We have the text for one person! How do we get this for all faculty members?

for loops

  • A for loop is a control structure in programming that allows repeating a set of operations multiple times.
  • It works by iterating over a sequence of elements (such as a vector or a list) and executing a block of code for each element in the sequence.
  • In R, the syntax for a for loop is as follows:
for (variable in sequence) {
  # code to be executed for each element in the sequence
}
  • Example:
for (i in 1:10) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

for loops

We can use a for loop to loop over the elements of our url variable

for(i in 1:nrow(spp)){

  # Load page for faculty member i
  faculty_member_page <- read_html(spp$url[i]) 
  
  # Extract text from that page
  faculty_member_text <- faculty_member_page %>%
                            html_nodes(xpath='//h2[contains(text(), "Research")]/following-sibling::p[1]') %>%
                            html_text() %>%
                            paste0(collapse = " ")
  
  # Save text for faculty member i
  spp$text[i] <- faculty_member_text
  
}

Output

spp[24,]
spp[35,]
                name
35 Dr Kalina Zhekova
                                                                                                     url
35 https://www.ucl.ac.uk/political-science/people/academic-teaching-and-research-staff/dr-kalina-zhekova
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     text
35 My current research focuses on Russian foreign policy and military interventions in the post-Soviet space and the Middle East, and Russian collective (mis-)conceptions of state sovereignty and relations with the West. I am interested in interpretivist and poststructuralist approaches to Russian politics, and I examine the development of external and internal threat constructions, notions of inter-ethnic identities and their mobilisation in the process of policymaking, war and violence. I apply this approach to the study of Russian armed interventions in Syria, Georgia and the war in Ukraine.

Topic Model for Departmental Research Interests

  • What should we do with this data?

  • As a preview of one of the topics we will cover after reading week, we will use this data to estimate a topic model

  • A topic model describes a collection of documents in terms of a distinct number of topics

  • Each document in the model is described as a mixture of corpus-wide topics

  • A topic is a probability distribution over words in the vocabulary

  • Two questions in this application:

    • What are the topics that feature in the staff research profiles?
    • Which staff members are most highly associated with each topic?s

Topic Model for Departmental Research Interests

Topic Model for Departmental Research Interests

Conclusion

Summing Up

  1. Think about your research projects now!

    • Start by identifying a paper that you might review
    • Think about a substantive question you would like to answer
    • Look for data that would help you to answer the question
  2. There are several possible sources of data for these projects

    • Existing datasources
    • Data collected via an API
    • Data collected via web-scraping
  3. Data collection is a major part of any research project – it is good to practice this step!

Seminars

Today we will learn to use the Guardian Newspaper API via the guardianapi package. There is also a web-scraping task for those of you would like to try!