1  Text As Data

1.1 Introduction to Quanteda

This exercise is designed to get you working with the quanteda package and some other associated packages. The focus will be on exploring the package, getting some texts into the corpus object format, learning how to convert texts into document-feature matrices, and performing descriptive analyses on this data.

If you did not take PUBL0055 last term, or if are struggling to remember R from all the way back in December, then you should work through the exercises on the R refresher page before completing this assignment.

1.1.1 Data

The Presidential Inaugural Corpusinaugural.csv

This data includes the texts of 59 US presidential inaugural address texts from 1789 to present. It also includes the following variables

Variable Description
Year Year of inaugural address
President President’s last name
FirstName President’s first name (and possibly middle initial)
Party Name of the President’s political party
text Text of the inaugural address

Once you have downloaded these files and stored them somewhere sensible, you can load them into R using the following commands:

inaugural <- read_csv("inaugural.csv")

1.1.2 Packages

You will need to install and load the following packages before beginning the assignment.

Run the following lines of code to install the following packages. Note that you only need to install packages once on your machine. One they are installed, you can delete these lines of code.

install.packages("quanteda") # Main quanteda package
install.packages("quanteda.textplots") # Package with helpful plotting functions
install.packages("quanteda.textstats") # Package with helpful statistical functions
install.packages("tidyverse") # Set of packages helpful for data manipulation

With these packages installed onto your computer, you now need to load them so that the functions stored in the packages are available to you in this R session. You will need to run the following lines each time you want to use functions from these packages.

library(quanteda)
library(quanteda.textplots)
Warning: package 'quanteda.textplots' was built under R version 4.3.1
library(quanteda.textstats)
Warning: package 'quanteda.textstats' was built under R version 4.3.1
library(tidyverse)

1.2 Creating a corpus

1.2.1 Getting Help

Coding in R can be a frustrating experience. Fortunately, there are many ways of finding help when you are stuck.

  • The website http://quanteda.io has extensive documentation.

  • You can view the help file for any function by typing ?function_name

  • You can use example() function for any function in the package, to run the examples and see how the function works.

Start by reading the help file for the corpus() function.

Reveal code
?corpus

1.2.2 Making a corpus and corpus structure

A corpus object is the foundation for all the analysis we will be doing in quanteda. The first thing to do when you load some text data into R is to convert it using the corpus() function.

  1. The simplest way to create a corpus is to use a set of texts already present in R’s global environment. In our case, we previously loaded the inaugural.csv file and stored it as the inaugural object. Let’s have a look at this object to see what it contains. Use the head() function applied to the inaugural object and report the output. Which variable includes the texts of the inaugural addresses?
Reveal code
head(inaugural)
# A tibble: 6 × 5
   Year President  FirstName Party                 text                         
  <dbl> <chr>      <chr>     <chr>                 <chr>                        
1  1789 Washington George    none                  "Fellow-Citizens of the Sena…
2  1793 Washington George    none                  "Fellow citizens, I am again…
3  1797 Adams      John      Federalist            "When it was first perceived…
4  1801 Jefferson  Thomas    Democratic-Republican "Friends and Fellow Citizens…
5  1805 Jefferson  Thomas    Democratic-Republican "Proceeding, fellow citizens…
6  1809 Madison    James     Democratic-Republican "Unwilling to depart from ex…

The output tells us that this is a “tibble” (which is just a special type of data.frame) and we can see the first six lines of the data. The column labelled text contains the texts of the inaugural addresses.

  1. Use the corpus() function on this set of texts to create a new corpus. The first argument to corpus() should be the inaugural object. You will also need to set the text_field to be equal to "text" so that quanteda knows that the text we are interested in is saved in that variable.
Reveal code
inaugural_corpus <- corpus(inaugural, text_field = "text")
  1. Once you have constructed this corpus, use the summary() method to see a brief description of the corpus. Which inaugural address was the longest in terms of the number of sentences?
Reveal code
summary(inaugural_corpus)
Corpus consisting of 59 documents, showing 59 documents:

   Text Types Tokens Sentences Year  President       FirstName
  text1   625   1537        23 1789 Washington          George
  text2    96    147         4 1793 Washington          George
  text3   826   2577        37 1797      Adams            John
  text4   717   1923        41 1801  Jefferson          Thomas
  text5   804   2380        45 1805  Jefferson          Thomas
  text6   535   1261        21 1809    Madison           James
  text7   541   1302        33 1813    Madison           James
  text8  1040   3677       121 1817     Monroe           James
  text9  1259   4886       131 1821     Monroe           James
 text10  1003   3147        74 1825      Adams     John Quincy
 text11   517   1208        25 1829    Jackson          Andrew
 text12   499   1267        29 1833    Jackson          Andrew
 text13  1315   4158        95 1837  Van Buren          Martin
 text14  1898   9123       210 1841   Harrison   William Henry
 text15  1334   5186       153 1845       Polk      James Knox
 text16   496   1178        22 1849     Taylor         Zachary
 text17  1165   3636       104 1853     Pierce        Franklin
 text18   945   3083        89 1857   Buchanan           James
 text19  1075   3999       135 1861    Lincoln         Abraham
 text20   360    775        26 1865    Lincoln         Abraham
 text21   485   1229        40 1869      Grant      Ulysses S.
 text22   552   1472        43 1873      Grant      Ulysses S.
 text23   831   2707        59 1877      Hayes   Rutherford B.
 text24  1021   3209       111 1881   Garfield        James A.
 text25   676   1816        44 1885  Cleveland          Grover
 text26  1352   4721       157 1889   Harrison        Benjamin
 text27   821   2125        58 1893  Cleveland          Grover
 text28  1232   4353       130 1897   McKinley         William
 text29   854   2437       100 1901   McKinley         William
 text30   404   1079        33 1905  Roosevelt        Theodore
 text31  1437   5821       158 1909       Taft  William Howard
 text32   658   1882        68 1913     Wilson         Woodrow
 text33   549   1652        59 1917     Wilson         Woodrow
 text34  1169   3719       148 1921    Harding       Warren G.
 text35  1220   4440       196 1925   Coolidge          Calvin
 text36  1090   3860       158 1929     Hoover         Herbert
 text37   743   2057        85 1933  Roosevelt     Franklin D.
 text38   725   1989        96 1937  Roosevelt     Franklin D.
 text39   526   1519        68 1941  Roosevelt     Franklin D.
 text40   275    633        27 1945  Roosevelt     Franklin D.
 text41   781   2504       116 1949     Truman        Harry S.
 text42   900   2743       119 1953 Eisenhower       Dwight D.
 text43   621   1907        92 1957 Eisenhower       Dwight D.
 text44   566   1541        52 1961    Kennedy         John F.
 text45   568   1710        93 1965    Johnson   Lyndon Baines
 text46   743   2416       103 1969      Nixon Richard Milhous
 text47   544   1995        68 1973      Nixon Richard Milhous
 text48   527   1369        52 1977     Carter           Jimmy
 text49   902   2780       129 1981     Reagan          Ronald
 text50   925   2909       123 1985     Reagan          Ronald
 text51   795   2673       141 1989       Bush          George
 text52   642   1833        81 1993    Clinton            Bill
 text53   773   2436       111 1997    Clinton            Bill
 text54   621   1806        97 2001       Bush       George W.
 text55   772   2312        99 2005       Bush       George W.
 text56   938   2689       110 2009      Obama          Barack
 text57   814   2317        88 2013      Obama          Barack
 text58   582   1660        88 2017      Trump       Donald J.
 text59   812   2766       216 2021      Biden       Joseph R.
                 Party
                  none
                  none
            Federalist
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
            Democratic
            Democratic
            Democratic
                  Whig
                  Whig
                  Whig
            Democratic
            Democratic
            Republican
            Republican
            Republican
            Republican
            Republican
            Republican
            Democratic
            Republican
            Democratic
            Republican
            Republican
            Republican
            Republican
            Democratic
            Democratic
            Republican
            Republican
            Republican
            Democratic
            Democratic
            Democratic
            Democratic
            Democratic
            Republican
            Republican
            Democratic
            Democratic
            Republican
            Republican
            Democratic
            Republican
            Republican
            Republican
            Democratic
            Democratic
            Republican
            Republican
            Democratic
            Democratic
            Republican
            Democratic

Joe Biden’s had the largest number of sentences.

  1. Note that although we specified text_field = "text" when constructing the corpus, we have not removed the metadata associated with the texts. To access the other variables, we can use the docvars() function applied to the corpus object that we created above. Try this now.
Reveal code
head(docvars(inaugural_corpus))
  Year  President FirstName                 Party
1 1789 Washington    George                  none
2 1793 Washington    George                  none
3 1797      Adams      John            Federalist
4 1801  Jefferson    Thomas Democratic-Republican
5 1805  Jefferson    Thomas Democratic-Republican
6 1809    Madison     James Democratic-Republican

1.3 Tokenizing texts

In order to count word frequencies, we first need to split the text into words (or longer phrases) through a process known as tokenization. Look at the documentation for quanteda’s tokens() function.

  1. Use the tokens command on corpus object that we ceated above, and examine the results.
Reveal code
inaugural_tokens <- tokens(inaugural_corpus)
  1. Experiment with some of the arguments of the tokens() function, such as remove_punct and remove_numbers.
Reveal code
inaugural_tokens <- tokens(inaugural_corpus, remove_punct = TRUE, remove_numbers = TRUE)
  1. Try tokenizing the sentences from data_corpus_inaugural into sentences, using tokens(x, what = "sentence").
Reveal code
inaugural_sentences <- tokens(inaugural_corpus, what = "sentence")
inaugural_sentences[1:2]

1.4 Creating a dfm()

Document-feature matrices are the standard way of representing text as quantitative data. Fortunately, it is very simple to convert the tokens objects in quanteda into dfms.

  1. Create a document-feature matrix, using dfm applied to the tokenized object that you created above. First, read the documentation using ?dfm to see the available options. Once you have created the dfm, use the topfeatures() function to inspect the top 20 most frequently occurring features in the dfm. What kinds of words do you see?
Reveal code
inaugural_dfm <- dfm(inaugural_tokens)
inaugural_dfm
Document-feature matrix of: 59 documents, 9,351 features (91.85% sparse) and 4 docvars.
       features
docs    fellow-citizens  of the senate and house representatives among
  text1               1  71 116      1  48     2               2     1
  text2               0  11  13      0   2     0               0     0
  text3               3 140 163      1 130     0               2     4
  text4               2 104 130      0  81     0               0     1
  text5               0 101 143      0  93     0               0     7
  text6               1  69 104      0  43     0               0     0
       features
docs    vicissitudes incident
  text1            1        1
  text2            0        0
  text3            0        0
  text4            0        0
  text5            0        0
  text6            0        0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,341 more features ]
topfeatures(inaugural_dfm, 20)
  the    of   and    to    in     a   our    we  that    be    is    it   for 
10183  7180  5406  4591  2827  2292  2224  1827  1813  1502  1491  1398  1230 
   by  have which   not  with    as  will 
 1091  1031  1007   980   970   966   944 

Mostly stop words.

  1. Experiment with different dfm_* functions, such as dfm_wordstem(), dfm_remove() and dfm_trim(). These functions allow you to reduce the size of the dfm following its construction. How does the number of features in your dfm change as you apply these functions to the dfm object you created in the question above?

Hint: You can use the dim() function to see the number of rows and columns in your dfms.

Reveal code
dim(inaugural_dfm)
[1]   59 9351
dim(dfm_wordstem(inaugural_dfm))
[1]   59 5508
dim(dfm_remove(inaugural_dfm, pattern = c("of", "the", "and")))
[1]   59 9348
dim(dfm_trim(inaugural_dfm, min_termfreq = 5, min_docfreq = 0.01, termfreq_type = "count", docfreq_type = "prop"))
[1]   59 2738
  1. Use the dfm_remove() function to remove English-language stopwords from this data. You can get a list of English stopwords by using stopwords("english").
Reveal code
inaugural_dfm_nostops <- dfm_remove(inaugural_dfm, pattern = stopwords("en"))
inaugural_dfm_nostops
Document-feature matrix of: 59 documents, 9,213 features (92.66% sparse) and 4 docvars.
       features
docs    fellow-citizens senate house representatives among vicissitudes
  text1               1      1     2               2     1            1
  text2               0      0     0               0     0            0
  text3               3      1     0               2     4            0
  text4               2      0     0               0     1            0
  text5               0      0     0               0     7            0
  text6               1      0     0               0     0            0
       features
docs    incident life event filled
  text1        1    1     2      1
  text2        0    0     0      0
  text3        0    2     0      0
  text4        0    1     0      0
  text5        0    2     0      0
  text6        0    1     0      1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,203 more features ]

1.5 Subsetting a Corpus

You can easily use quanteda to subset a corpus. There is a corpus_subset() method defined for a corpus, which works just like R’s normal subset() command. For instance if you want a wordcloud of just Obama’s two inaugural addresses, you would need to subset the corpus first:

obama_corpus <- corpus_subset(inaugural_corpus, President == "Obama")
obama_tokens <- tokens(obama_corpus)
obama_dfm <- dfm(obama_tokens)
textplot_wordcloud(obama_dfm)

  1. Try producing the same plot as above, but without the stopwords and without punctuation.

Hint: To remove stopwords, use dfm_remove(). To remove punctuation, pass remove_punct = TRUE to the tokens() function.

Reveal code
obama_tokens <- tokens(obama_corpus, remove_punct = TRUE)
obama_dfm <- dfm(obama_tokens)
obama_dfm <- dfm_remove(obama_dfm, pattern = stopwords("en"))
textplot_wordcloud(obama_dfm)

1.6 Pipes

In the lecture we learned about the %>% “pipe” operator, which allows us to chain together different functions so that the output of one function gets passed directly as input to another function. We can use these pipes to simplify our code and to make it somewhat easier to read.

For instance, we could join together the corpus-construction and tokenisation steps that we did separately above using a pipe:

inaugural_tokens <- inaugural %>% # Take the original data object
  corpus(text_field = "text") %>% # ...convert to a corpus
  tokens(remove_punct = TRUE) #...and then tokenize

inaugural_tokens[1]
Tokens consisting of 1 document and 4 docvars.
text1 :
 [1] "Fellow-Citizens" "of"              "the"             "Senate"         
 [5] "and"             "of"              "the"             "House"          
 [9] "of"              "Representatives" "Among"           "the"            
[ ... and 1,418 more ]
  1. Write some code using the %>% operator that does the following: a) creates a corpus; b) tokenizes the texts; c) creates a dfm; d) removes stopwords; and e) reports the top features of the resulting dfm.
Reveal code
inaugural %>% # Take the original data object
  corpus(text_field = "text") %>% # ...convert to a corpus
  tokens(remove_punct = TRUE) %>% #... tokenize
  dfm() %>% #...convert to a dfm
  dfm_remove(pattern = stopwords("english")) %>% # ...remove stopwords
  topfeatures() # ...report top features
    people government         us        can       must       upon      great 
       584        564        505        487        376        371        344 
       may     states      world 
       343        334        319 

1.6.1 Descriptive statistics

  1. Use the ntoken() and ntype() functions on the inaugural_corpus object. Create a plot showing the relationship between the quantities that you calculate from these functions.
Reveal code
inaugural_ntokens <- ntoken(inaugural_corpus)
inaugural_ntypes <- ntype(inaugural_corpus)

plot(inaugural_ntokens,
     inaugural_ntypes,
     xlab = "N tokens",
     ylab = "N types")

  1. One simple measure of lexical diversity is the token-to-type ratio: \(\frac{\text{N type}}{\text{N token}}\). Calculate this ratio for each of the inaugural addresses using the objects that you created in the question above. Plot this measure against the Year variable associated with each of the texts.

Hint: Remember that you can access the metadata associated with each of the texts in the corpus using the docvars() function.

Reveal code
inaugural_ttr <- inaugural_ntypes/inaugural_ntokens

plot(docvars(inaugural_corpus)$Year,
     inaugural_ttr,
     xlab = "Year",
     ylab = "Token-Type Ratio",
     type = "b",
     bty = "n")

1.7 Key-Words-In-Context

quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Use the kwic() function (for “keywords-in-context”) to explore how a specific word or phrase is used in this corpus (use the word-based tokenization that you implemented above). You can look at the help file (?kwic) to see the arguments that the function takes.

  1. Use the kwic() function to see how the word “terror” is used

Hint: By default, kwic gives exact matches for a given pattern. What if we wanted to see words like “terrorism” and “terrorist” rather than exactly “terror”? We can use the wildcard character * to expand our search by appending it to the end of the pattern we are using to search. For example, we could use "terror*".

Reveal code
kwic(inaugural_tokens, "terror", 3)
Keyword-in-context with 8 matches.                                                           
  [text3, 1190]                   or violence by | terror |
   [text37, 99] nameless unreasoning unjustified | terror |
  [text39, 258]                  by a fatalistic | terror |
  [text44, 761]             uncertain balance of | terror |
  [text49, 700]               Americans from the | terror |
  [text53, 921]                the fanaticism of | terror |
 [text53, 1454]           strong defense against | terror |
 [text56, 1442]                 aims by inducing | terror |
                           
 intrigue or venality      
 which paralyzes needed    
 we proved that            
 that stays the            
 of runaway living         
 And they torment          
 and destruction Our       
 and slaughtering innocents
  1. Try substituting your own search terms to the kwic() function.
Reveal code
head(kwic(inaugural_tokens, "america", 3))
Keyword-in-context with 6 matches.                                                                           
   [text2, 59]      people of united | America | Previous to the           
   [text3, 14]     middle course for | America | remained between unlimited
  [text3, 385]         the people of | America | were not abandoned        
 [text3, 1272]         the people of | America | have exhibited to         
 [text3, 1791] aboriginal nations of | America | and a disposition         
 [text3, 1929]         the people of | America | and the internal          
head(kwic(inaugural_tokens, "democracy", 3))
Keyword-in-context with 6 matches.                                                           
 [text10, 1424] a confederated representative | democracy |
  [text14, 497]                    to that of | democracy |
 [text14, 1474]       a simple representative | democracy |
 [text14, 6900]                   the name of | democracy |
 [text14, 7289]                of devotion to | democracy |
  [text34, 970]      temple of representative | democracy |
                      
 were a government    
 If such is           
 or republic and      
 they speak warning   
 The foregoing remarks
 to be not            

1.8 Homework

The file sdg_goals_targets.csv contains the text of the UN Sustainable Development Goals and Targets that we saw in the lecture. Load this data into your R session, create a corpus using the long_description variable, tokenize the corpus, and create a dfm. Use the dfm to create a wordcloud.

sdg_dfm <- sdg %>%
  corpus(text_field = "long_description") %>%
  tokens(remove_punct = T) %>%
  dfm() %>%
  dfm_remove(pattern = stopwords("english"))

textplot_wordcloud(sdg_dfm)

Upload your plot to this Moodle page.