5: Designing Projects and Collecting Text Data

Ashrakat Elshehawy

Research Project Advice

Components

  1. Review an Existing Text-as-Data Application (25%)

    • Any paper that uses quantitative text analysis of the type we study in the course
    • Must be from the social sciences (political science, public policy, economics, sociology, psychology, etc)
    • 1,000 words
    • Goal: Critical review of a single article
  2. Application of a Text-as-Data Method (75%)

    • Any method that we have studied on the course
    • Must be answering a social science research question
    • 2,000 words
    • Goals: Demonstrate understanding of method, and ability to apply, validate and interpret

Criteria

  1. Review an Existing Text-as-Data Application (30%)

    • Accuractely describe method and key implementation details
    • Discuss assumptions of method and whether they are met in this application
    • Assessment of strengths and weaknesses of the approach
    • Propose one alternative strategy for this application
  2. Application of a Text-as-Data Method (70%)

    • Clear and answerable research question
    • Accurate description of selected method
    • Discussion of assumptions behind the method
    • Description/Use of a dataset that we haven’t used on the course
    • Implementation of chosen method, with clear description/justification of analysis choices
    • Accurate interpretation of method output
    • Some attempt to validate chosen approach, for UG suggestions of validation approaches.
    • Discussion of strengths and weaknesses of chosen approach
    • Some credit for originality and ambition of project

Note: Points in italics are only required for PGT students.

Questions First, Data Second

  1. Identify an interesting question

  2. Search for data that can answer that question

  3. Answer the question

Advantages:

  • Likely to lead to more interesting questions

    • A good research project answers an interesting question, and in this model you think about that first
  • Likely to lead to less time exploring different methods

    • Your question will guide you to a particular analysis
    • If you need to measure topics to answer your question, then use a topic model. If you need to measure complexity, then use a readability metric, etc.

Disadvantages:

  • Potentially large amount of time spent searching for/collecting data

    • The data you need may not exist in an easily accessible form
    • You will need to spend time and effort collecting it either by downloading it from various sources or getting it from an API or via web-scraping
  • Potential risk of finding nothing

    • The data you need may not exist at all!
    • Then you will need to find another question

Data First, Questions Second

  1. Identify an interesting dataset

  2. Explore that dataset using some of the tools we cover on the course

  3. Construct a research question that you can answer using that data

  4. Answer the question

Advantages:

  • Lower frustration

    • You will not spend lots of effort thinking about projects that have no hope!
  • Potentially faster

    • You will not spend lots of time trying to find data only to discover that it doesn’t exist

Disadvantages:

  • Potentially less interesting research question/answers

    • e.g. if the only data you can find is a BBC food archive, then you might have to write a silly paper all about the classification of curry receipes…
  • Often more limited metadata

    • The texts are themselves rarely sufficient for a compelling analysis. We often want metadata so that we can describe variation in some quantity of interest
    • Does your data include information about the authors of the documents? The dates they were produced? Etc.

Principles of Good Research Projects

  1. Be clear about the concept that you are trying to measure

    • Why is this an interesting concept?
    • Why is doing it this way an improvement on existing approaches?
  2. Discuss the assumptions behind your chosen method

    • What needs to be true in order for your measure to produce reasonable results?
  3. Provide some form of validation for your measure. E.g.

    • Hand-code a sample of documents and show that your quantitative text measure is associated with the human coding
    • Show that your measure passess basic face-validity checks (does it correlate with things in sensible ways, are the top-scoring texts sensible, etc)
  4. Demonstrate something interesting with the concept that you have measured

    • Finding some texts and implementing a method on the course is not sufficient – you need to answer some kind of social-science research question
    • Make use of metadata!
    • Show how the quantities you are estimating vary across groups/over time etc.

Where To Find Good Research Questions

Where To Find Good Research Questions

Thinking about the World

  • Read news articles

  • Discuss potential applications with classmates

  • Good starting points:

    • The Economist
    • The Financial Times
    • Podcasts

Interesting Applications

But there are many more!

Using Existing Datasets

Dataverse

  • The Harvard Dataverse (https://dataverse.harvard.edu) is a data and code repository for many social science journals.

  • Many (though not all) papers will have links directly to a Dataverse page which you can use to find the data that was used in the paper

  • This is an excellent source of data for your projects!

  • Sometimes it can take a bit of searching through the files in each repository to figure out where the data is

Kaggle

  • Kaggle is a platform that hosts a wide variety of resources for quantitative text analysis, including a broad collection of text datasets (https://www.kaggle.com/datasets)

  • Many of these datasets are potentially interesting to social scientists, e.g.

    • New York Times Comments
    • Tweets about COVID-19
    • Corpus of Academic Papers relating to COVID19
    • Donald Trump’s Rally Speeches
    • etc
  • Many of these datasets lack full documentation, particularly on important dimensions such as where the data came from, who provided it, and so on

APIs

APIs

  • API: Application Programming Interface — a way for two pieces of software to talk to each other

  • Your software can receive (and also send) data automatically through these services

  • Data is sent by — the same way your browser does it

  • Most services have helping code (known as a wrapper) to construct http requests

  • Both the wrapper and the service itself are called APIs

  • http service also sometimes known as REST (Representational State Transfer)

Source: GeeksforGeeks

API registration and authentication

  • APIs typically require you to register for an API key to allow access

    • Many are not free, at least for large-scale use
  • Before you commit to using a given API, check what the rate limits are on its use

    • Limits on total number of requests for a given user
    • Limits on the total number of requests in a given day/minute/hour etc
  • Make sure you register with the service in plenty of time to actually get the data!

  • Once registered, you will have access to some kind of key that will allow you to access the API

http requests

It is helpful to start paying attention to the structure of basic http requests.

For instance, let’s say we want to get some data from the TheyWorkForYou api.

A test request:

https://www.theyworkforyou.com/api/getDebates&output=xml&search=brexit&num=1000&key=XXXXX

  • Parameters to the API are encoded in the URL

    • output = Which format do you want returned?
    • search = Return speeches with which words?
    • num = number requested
    • key = access key

API Output

  • The output of an API will typically not be in csv or Rdata format

  • Often, though not always, it will be in either JSON and XML

    • XML: eXtensible Markup Language

    • JSON : JavaScript Object Notation

  • If you have a choice, you probably want JSON

  • Both types of file are easily read into R

  • json_lite and xml2 are the relevant packages

API packages

  • It’s not usually necessary to construct these kind of requests yourself

  • R, Python, and other programming languages have libraries to make it easier – but you have to find them!

  • I have provided a sample of APIs that have associated R packages on the next slide

  • The documentation for the API will describe the parameters that are available. Though normally in a way that is intensely frustrating.

Sample of APIs

There are many existing R packages that make it straightforward to retreive data from an API:

API R package Description
Twitter install.packages("rtweet") Twitter, small-scale use, no longer free!
Guardian Newspaper install.packages("guardianapi") Full Guardian archive, 1999-present
Wikipedia install.packages("WikipediR") Wikipedia data and knowledge graph
TheyWorkForYou install.packages("twfy") Speeches from the UK House of Commons and Lords
ProPublica Congress API install.packages("ProPublicaR") Data from the US Congress
Google Books Ngrams install.packages("ngramr") Ngrams in Google Books, 1500-present
Reddit install.packages("RedditExtractoR") Subreddits, users, urls, texts of posts

Warning: I have not tested all of these!

API demonstration

Reddit API

  • We will use the Reddit API to search for subreddits on UK Politics in the past year

  • For this example, we are not collecting a large amount of data

  • In general, you need to created a authenticated client id

  • Rate limits: currently 100 queries per minute (QPM) per OAuth client id

  • We will use library(RedditExtractoR)

Reddit API Application

library(RedditExtractoR) 

subred <- find_subreddits("UK politics")
glimpse(subred)
Rows: 233
Columns: 7
$ id          <chr> "483mxu", "4rfocy", "3g37x", "2r7v0", "jrrb9", "3ahdz", "6…
$ date_utc    <chr> "2021-04-08", "2021-07-15", "2016-08-30", "2009-09-23", "2…
$ timestamp   <dbl> 1617865911, 1626368162, 1472562586, 1253680171, 1527610526…
$ subreddit   <chr> "Divisive_Babble", "SteamDeck", "brandonlawson", "Academic…
$ title       <chr> "Divisive Babble", "Steam Deck", "The Search For Brandon L…
$ description <chr> "This is a forum for the discussion of politics, current a…
$ subscribers <dbl> 1164, 1034951, 5667, 58183, 91, 74, 29290, 9842, 1244195, …

Reddit API Application

Let’s clean the dataset and order those subreddits by number of subscribers.

subred <- subred %>%
  
  # selects variables you want to explore
  select(subreddit, title, description, subscribers) %>% 
  
  # creates new variables
  mutate(subscribers_million = subscribers/1000000, subscribers = NULL) %>% 
  
  # arranges data from highest subscriber count
  arrange(desc(subscribers_million)) 

head(subred[c("subreddit", "title", "subscribers_million")])
      subreddit                            title subscribers_million
2qh1i AskReddit                    Ask Reddit...           57.322329
2qh13 worldnews                       World News           46.913243
2qjpg     memes /r/Memes the original since 2008           35.533307
2qh55      food            Food Photos on Reddit           24.381233
2qh4j    europe                           Europe           11.460773
2cneq  politics                         Politics            8.966519

Reddit API Application

Many are not about UK politics!

Let’s now try to extract only those that are truly about UK politics, by searching for the terms “UK politic^” or “British politic^” in the description.

uk.subred <- subred %>%
  filter(grepl("UK politic|British politic", description, ignore.case = TRUE) |
  grepl("UK politic|British politic", title, ignore.case = TRUE))

head(uk.subred[c("subreddit", "title", "subscribers_million")])
               subreddit                    title subscribers_million
2qhcv         ukpolitics              UK Politics            0.532235
30c1v           LabourUK The British Labour Party            0.069528
tzpe1  UKPoliticalComedy      UK Political Comedy            0.059908
33geh        UK_Politics              UK Politics            0.005360
2qo8i         PoliticsUK   UK Politics Discussion            0.003332
4th4dw     AskUKPolitics            AskUKPolitics            0.000642

Reddit API Application

The two most common non-partisan subreddits on British politics are UKPoliticalComedy and ukpolitics

Let’s have a look at these two:

uk.comedy <- find_thread_urls(subreddit = 'UKPoliticalComedy', sort_by = 'top', period = 'all')
uk.politics <- find_thread_urls(subreddit = 'ukpolitics', sort_by = 'top', period = 'all')

What is in this data?

glimpse(uk.comedy) 
Rows: 248
Columns: 7
$ date_utc  <chr> "2021-11-10", "2023-01-27", "2022-10-03", "2021-03-20", "202…
$ timestamp <dbl> 1636530752, 1674848275, 1664830422, 1616259328, 1615137661, …
$ title     <chr> "New Conservatives logo", "Time to light the fire under Mr Z…
$ text      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "UKPoliticalComedy", "UKPoliticalComedy", "UKPoliticalComedy…
$ comments  <dbl> 3, 1, 5, 7, 6, 0, 13, 8, 1, 0, 12, 7, 1, 8, 4, 11, 36, 1, 28…
$ url       <chr> "https://www.reddit.com/r/UKPoliticalComedy/comments/qqp9c0/…
glimpse(uk.politics) 
Rows: 228
Columns: 7
$ date_utc  <chr> "2019-03-18", "2017-09-10", "2021-01-06", "2019-05-02", "201…
$ timestamp <dbl> 1552923461, 1505071469, 1609942152, 1556832034, 1575982428, …
$ title     <chr> "BREAKING: Speaker rules out the government bringing back me…
$ text      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ subreddit <chr> "ukpolitics", "ukpolitics", "ukpolitics", "ukpolitics", "ukp…
$ comments  <dbl> 1061, 871, 354, 170, 771, 129, 640, 1648, 681, 569, 252, 575…
$ url       <chr> "https://www.reddit.com/r/ukpolitics/comments/b2k5i1/breakin…

Reddit API Application

We will work on the titles of those threads to measure sentiment with a dictionary based approach

make.dfm <- function(data){
    dfm <- data %>%
        corpus(text_field = "title") %>%       # treat the "title" column as documents
        tokens(remove_punct = TRUE,            # split into words, dropping punctuation,
               remove_symbols = TRUE,          # symbols like £ or @,
               remove_url = TRUE) %>%          # and URLs
        dfm() %>%                              # convert to document-feature matrix (word counts)
        dfm_remove(stopwords("en")) %>%        # drop English stopwords (the, is, and...)
        dfm_trim(min_termfreq = 3,             # drop words appearing fewer than 3 times total
                 max_docfreq = .9,             # and words in more than 90% of docs (too common)
                 docfreq_type = "prop") %>%
        dfm_select(pattern = "\\b\\w{3,}\\b", # keep only words with 3+ characters
                   valuetype = "regex",
                   selection = "keep")
   return(dfm)
}

comedy.dfm <- make.dfm(uk.comedy)    # build DFM for comedy subreddit posts
politics.dfm <- make.dfm(uk.politics) # build DFM for politics subreddit posts

How many features in those DFMs?

dim(comedy.dfm)
[1] 248  38
dim(politics.dfm)
[1] 228 179

Reddit API Application

Let’s measure sentiment based on the Lexicoder Sentiment Dictionary, available in quanteda as data_dictionary_LSD2015

dfm_lookup(comedy.dfm, dictionary = data_dictionary_LSD2015) %>%  # check matches to Lexicoder sentiment dictionary (pos/neg words)
  dfm_remove(c("neg_positive", "neg_negative")) %>%               # drop negated categories (e.g. "not good" counted as neg_positive)
  dfm_weight(scheme = "logave") %>%                                # weight by log-average to normalise for document length
  convert("data.frame") %>%                                        # convert DFM to a regular dataframe
  mutate(doc_id=NULL,                                              # drop the document ID column
         positive = trunc(positive),                               # truncate decimals to whole numbers
         negative = trunc(negative)) %>%
  mutate(neutral = positive == negative) %>%                       # TRUE/FALSE: is the doc equally pos and neg?
  colMeans(na.rm = TRUE) %>%                                       # average each column across all documents
  print()
  negative   positive    neutral 
0.01209677 0.11693548 0.87096774 
dfm_lookup(politics.dfm, dictionary = data_dictionary_LSD2015)%>% 
  dfm_remove(c("neg_positive", "neg_negative")) %>%
  dfm_weight(scheme = "logave")  %>%
  convert("data.frame") %>%
  mutate(doc_id=NULL, positive = trunc(positive), negative = trunc(negative)) %>% 
  mutate(neutral = positive == negative) %>% 
  colMeans(na.rm = TRUE) %>% print()
  negative   positive    neutral 
0.08771930 0.09649123 0.82456140 

Other functions

# Get thread content, for given URLs
get_thread_content()
# Get information on a particular user, for given list of users
get_user_content()

You could use those functions to explore:

  • The correlation between comment sentiment and upvotes/downvotes

  • Topics across sub-reddits

  • How conversation evolve depending on the topic

  • … and many other research questions!

Break & Q&A


If you haven’t already done so, please register now to use the Guardian Newspaper API: https://open-platform.theguardian.com

Web-scraping

Web scraping overview

Key steps in any web-scraping project:

  1. Work out how the website is structured

  2. Work out how links connect different pages

  3. Isolate the information you care about on each page

  4. Write a loop which connects 3 to 2, and saves the information you want from each page

  5. Put it all into a nice and tidy data.frame

  6. Feel like a superhero 🪄

(This is missing the steps in which you scream at your computer because you can’t figure out how to do steps 1-5.)

Web-scraping Demonstration

Web-scraping Demonstration

  • We will scrape the research interests of members of faculty in the Department of Political Science at UCL

  • The departmental website has a list of faculty members

  • Each member of the department has a unique page

  • The research interests of the faculty member are stored on their unique page

  • Let’s look at an exmple…

Source code

  • To collect the information we want, we need to see how it is stored within the html code that underpins the website

  • Webpages include much more than what is immediately visible to visitors

  • Crucially, they include code which provides structure, style and functionality (which your browser interprets)

    • HTML provides strucutre
    • css provides style
    • JavaScript provides functionality
  • To implement a web-scraper, we have to work directly with the source code

    • Identifying the information on each page that we want to extract
    • Identifying links between pages that help us to navigate the page programmatically

To see the source code, use Ctrl + U or right click and select View/Show Page Source

Load initial page

We can read the source code of any website into R using the readLines() function.

library(tidyverse)


spp_home <- "https://www.ucl.ac.uk/social-historical-sciences/political-science/people/academic-teaching-and-research-staff"
spp_html <- readLines(spp_home)
spp_html[1647]
[1] ""
spp_html[grep("Lauge Poulsen", spp_html)[1]]
[1] "                  <img loading=\"lazy\" width=\"675\" height=\"422\" src=\"/social-historical-sciences/sites/social_historical_sciences/files/styles/wysiwyg_mobile/public/lauge_poulsen.png.jpg?itok=LP6KQw9e\" alt=\"Lauge Poulsen\">"

The structured tags (<td>, <p>, <a>, <img>) are how content is organized:

  • <td> — a table cell containing each person’s information
  • <a href="https://profiles.ucl.ac.uk/..."> — a clickable link to each person’s profile page
  • <p> — paragraphs containing the research description
  • <img> — the person’s photo

This is helpful, but it is awkward to navigate the source code directly.

Parse HTML

The read_html function in the rvest package allows us to read the HTML in a more structured format:

#install.packages("rvest)
library(rvest)

spp_page <- read_html(spp_home)
spp_page
{html_document}
<html lang="en" dir="ltr" prefix="og: https://ogp.me/ns#">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="colour-scheme--gordon-glow">\n        <a href="#main-content ...

Retrieve urls and names for each member of faculty

We can then navigate through the HTML by searching for elements that have common elements (using html_elements()):

spp_faculty_elements <- spp_page %>% #the main html
  html_elements("table a[href*='profiles.ucl.ac.uk']")
spp_faculty_elements[0:5]
{xml_nodeset (5)}
[1] <a href="https://profiles.ucl.ac.uk/9520-rod-abouharb" rel="nofollow"><st ...
[2] <a href="https://profiles.ucl.ac.uk/72700-valentina-amuso" rel="nofollow" ...
[3] <a href="https://profiles.ucl.ac.uk/86078-samer-anabtawi" rel="nofollow"> ...
[4] <a href="https://profiles.ucl.ac.uk/91233-phillip-ayoub" rel="nofollow">< ...
[5] <a href="https://profiles.ucl.ac.uk/1510-kristin-bakke" rel="nofollow"><s ...

The names of each faculty member are stored in the text associated with these elements:

as.character(spp_faculty_elements[27]) %>% print()
[1] "<a href=\"https://profiles.ucl.ac.uk/91231-gloria-gennaro\" rel=\"nofollow\"><strong>Dr Gloria Gennaro</strong></a>"

We can extract these names using the html_text2() function:

spp_faculty_names <- spp_faculty_elements %>% html_text2()
head(spp_faculty_names)
[1] "Dr M. Rodwan Abouharb"     "Dr Valentina Amuso"       
[3] "Dr Samer Anabtawi"         "Professor Phillip Ayoub"  
[5] "Professor Kristin M Bakke" "Dr Carlos Balcazar"       

Retrieve URL for each member of faculty

The URL for each faculty member is stored in the href attribute of the elements:

spp_urls <- spp_faculty_elements %>%
  html_attr("href")                                    # extract the href attribute from each link
                                                       # these are already full URLs like
                                                       # "https://profiles.ucl.ac.uk/62358-jack-blumenau"
head(spp_urls)
[1] "https://profiles.ucl.ac.uk/9520-rod-abouharb"     
[2] "https://profiles.ucl.ac.uk/72700-valentina-amuso" 
[3] "https://profiles.ucl.ac.uk/86078-samer-anabtawi"  
[4] "https://profiles.ucl.ac.uk/91233-phillip-ayoub"   
[5] "https://profiles.ucl.ac.uk/1510-kristin-bakke"    
[6] "https://profiles.ucl.ac.uk/101976-carlos-balcazar"

Storage

spp <- data.frame(name = spp_faculty_names, 
                  url = spp_urls, 
                  text = NA)

head(spp)
                       name                                               url
1     Dr M. Rodwan Abouharb      https://profiles.ucl.ac.uk/9520-rod-abouharb
2        Dr Valentina Amuso  https://profiles.ucl.ac.uk/72700-valentina-amuso
3         Dr Samer Anabtawi   https://profiles.ucl.ac.uk/86078-samer-anabtawi
4   Professor Phillip Ayoub    https://profiles.ucl.ac.uk/91233-phillip-ayoub
5 Professor Kristin M Bakke     https://profiles.ucl.ac.uk/1510-kristin-bakke
6        Dr Carlos Balcazar https://profiles.ucl.ac.uk/101976-carlos-balcazar
  text
1   NA
2   NA
3   NA
4   NA
5   NA
6   NA

Retrieve unique research paragraph for each faculty member

Retrieve unique page for each faculty member

library(rvest)
library(stringr)


jack_cell <- spp_page %>% #from the the main html
  html_elements("table td:nth-child(2)") %>%   # get the second column of every table row
                                                 # (col 1 = photo, col 2 = name + title + bio)
  keep(~ grepl("Blumenau",                      # search for "Blumenau"
               html_text2(.x)))                  # in the visible text of each cell
                                                 # keep() retains only matching cells
jack_cell
{xml_nodeset (1)}
[1] <td>\n<p><a href="https://profiles.ucl.ac.uk/62358-jack-blumenau" rel="no ...
jack_full <- jack_cell %>% html_text2()
jack_full
[1] "Dr Jack Blumenau\nAssociate Professor of Political Science and Quantitative Research Methods\n\nDr Blumenau’s research addresses questions about what voters want, how politicians act, and how these preferences and behaviours interact to affect electoral outcomes and political representation in democratic systems."

We have the text for one person! How do we get this for all faculty members?

for loops

  • A for loop is a control structure in programming that allows repeating a set of operations multiple times.
  • It works by iterating over a sequence of elements (such as a vector or a list) and executing a block of code for each element in the sequence.
  • In R, the syntax for a for loop is as follows:
for (variable in sequence) {
  # code to be executed for each element in the sequence
}
  • Example:
for (i in 1:10) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

for loops

We can use a for loop to loop over the elements of our url variable

# Step 1: Get all info cells from the surname table
# Each row has two columns: (1) photo, (2) name + title + bio
# We grab the second column of every row
all_cells <- spp_page %>%
  html_elements("table td:nth-child(2)")

for(i in 1:nrow(spp)){

  # Step 2: Get the last name for person i from our data frame
  last_name <- word(spp$name[i], -1)       # word() extracts the last word
                                            # e.g. "Dr Jack Blumenau" → "Blumenau"

  # Step 3: Search all cells for one containing that last name
  person_cell <- all_cells %>%
    keep(~ grepl(last_name,                 # search for the last name
                 html_text2(.x),            # in the text of each cell
                 #.x is a placeholder for the current element
                 #html_text2 extracts visible text
                 fixed = TRUE))             # exact match (no regex)

  # Step 4: Save the text, or NA if no match found
  if(length(person_cell) > 0){
    spp$text[i] <- html_text2(person_cell[[1]])  # take first match
  } else {
    spp$text[i] <- NA                             # person not in table
  }
}

Output

spp[1,3]
[1] "Dr M. Rodwan Abouharb\nAssociate Professor in International Relations\n\nDr Abouharb’s research places particular emphasis on understanding how both domestic and international socio-economic processes affect the human security of citizens around the world."
spp[40,3]
[1] "Professor Benjamin Lauderdale\nProfessor of Political Science\n\nProf Lauderdale’s research is focused on developing new designs for highly multidimensional survey experiments that enable us to better measure key concepts relevant to public opinion and political behaviour."
spp[10,3]
[1] "Dr Jeremy Bowles\nLecturer in Comparative Politics\n\nDr Bowles works on the political economy of development, mostly focused on sub-Saharan Africa."

Topic Model for Departmental Research Interests

  • Let’s use this data to estimate a topic model

  • Two questions in this application:

    • What are the topics that feature in the staff research profiles?
    • Which staff members are most highly associated with each topics?

Topic Model for Departmental Research Interests

library(stm)
library(quanteda)

spp <- spp %>% 
  filter(!is.na(text) & text != "")   # remove rows where the text is missing (NA)
                                       # or empty ("") — these are people whose
                                       # last name didn't match in the surname table


## Create dfm
spp_corpus <- spp %>% 
  corpus(text_field = "text")

spp_dfm <- spp_corpus %>%
  tokens(remove_punct = TRUE, 
         remove_symbols = TRUE, 
         remove_numbers = TRUE, 
         remove_url = TRUE) %>%
  dfm() %>%
  dfm_remove(c(stopwords("en"), "book",
               "journal", "professor",
               "prof", "Prof.", "associate",
               "teaching", "study","project","focus",
               "focuses","focused", "interests", "particular",
               "lecturer", "well", "including", 
               "include","dr","dr.","interested","works",
               "studies","fellow","director", "emphasis", "within")) %>%
dfm_trim(min_termfreq = 4,              # drop words that appear fewer than 2 times total
         min_docfreq = 4) %>%            # drop words that appear in fewer than 2 documents
dfm_trim(max_docfreq = .9,              # drop words that appear in more than 90% of documents
         docfreq_type = "prop")          # interpret max_docfreq as a proportion (not a count)
                                          # these are too common to be informative (e.g. "research")

## Estimate STM
stmOut <- stm(
            documents = spp_dfm,
            K = 12,
            seed = 123,
            verbose = FALSE #run silently
            )
         
#save(stmOut, file = "stmOut.Rdata")

Topic Model for Departmental Research Interests

plot(stmOut,
     labeltype = "frex")

Topic Model for Departmental Research Interests

Conclusion

Summing Up

  1. Think about your research projects now!

    • Start by identifying a paper that you might review
    • Think about a substantive question you would like to answer
    • Look for data that would help you to answer the question
  2. There are several possible sources of data for these projects

    • Existing datasources
    • Data collected via an API
    • Data collected via web-scraping
  3. Data collection is a major part of any research project – it is good to practice this step!