1.1 Introduction to Quanteda
This exercise is designed to get you working with the quanteda package and some other associated packages. The focus will be on exploring the package, getting some texts into the corpus
object format, learning how to convert texts into document-feature matrices, and performing descriptive analyses on this data.
If you did not take PUBL0055 last term, or if are struggling to remember R from all the way back in December, then you should work through the exercises on the R refresher page before completing this assignment.
1.1.1 Data
The Presidential Inaugural Corpus – inaugural.csv
This data includes the texts of 59 US presidential inaugural address texts from 1789 to present. It also includes the following variables
Variable | Description |
---|---|
Year |
Year of inaugural address |
President |
President’s last name |
FirstName |
President’s first name (and possibly middle initial) |
Party |
Name of the President’s political party |
text |
Text of the inaugural address |
Once you have downloaded these files and stored them somewhere sensible, you can load them into R using the following commands:
1.1.2 Packages
You will need to install and load the following packages before beginning the assignment.
Run the following lines of code to install the following packages. Note that you only need to install packages once on your machine. One they are installed, you can delete these lines of code.
With these packages installed onto your computer, you now need to load them so that the functions stored in the packages are available to you in this R session. You will need to run the following lines each time you want to use functions from these packages.
1.2 Creating a corpus
1.2.1 Getting Help
Coding in R can be a frustrating experience. Fortunately, there are many ways of finding help when you are stuck.
The website http://quanteda.io has extensive documentation.
You can view the help file for any function by typing
?function_name
You can use
example()
function for any function in the package, to run the examples and see how the function works.
Start by reading the help file for the corpus()
function.
1.2.2 Making a corpus and corpus structure
A corpus object is the foundation for all the analysis we will be doing in quanteda
. The first thing to do when you load some text data into R is to convert it using the corpus()
function.
- The simplest way to create a corpus is to use a set of texts already present in R’s global environment. In our case, we previously loaded the
inaugural.csv
file and stored it as theinaugural
object. Let’s have a look at this object to see what it contains. Use thehead()
function applied to theinaugural
object and report the output. Which variable includes the texts of the inaugural addresses?
Reveal code
# A tibble: 6 × 5
Year President FirstName Party text
<dbl> <chr> <chr> <chr> <chr>
1 1789 Washington George none "Fellow-Citizens of the Sena…
2 1793 Washington George none "Fellow citizens, I am again…
3 1797 Adams John Federalist "When it was first perceived…
4 1801 Jefferson Thomas Democratic-Republican "Friends and Fellow Citizens…
5 1805 Jefferson Thomas Democratic-Republican "Proceeding, fellow citizens…
6 1809 Madison James Democratic-Republican "Unwilling to depart from ex…
The output tells us that this is a “tibble” (which is just a special type of data.frame) and we can see the first six lines of the data. The column labelled
text
contains the texts of the inaugural addresses.
- Use the
corpus()
function on this set of texts to create a new corpus. The first argument tocorpus()
should be theinaugural
object. You will also need to set thetext_field
to be equal to"text"
so that quanteda knows that the text we are interested in is saved in that variable.
- Once you have constructed this corpus, use the
summary()
method to see a brief description of the corpus. Which inaugural address was the longest in terms of the number of sentences?
Reveal code
Corpus consisting of 59 documents, showing 59 documents:
Text Types Tokens Sentences Year President FirstName
text1 625 1537 23 1789 Washington George
text2 96 147 4 1793 Washington George
text3 826 2577 37 1797 Adams John
text4 717 1923 41 1801 Jefferson Thomas
text5 804 2380 45 1805 Jefferson Thomas
text6 535 1261 21 1809 Madison James
text7 541 1302 33 1813 Madison James
text8 1040 3677 121 1817 Monroe James
text9 1259 4886 131 1821 Monroe James
text10 1003 3147 74 1825 Adams John Quincy
text11 517 1208 25 1829 Jackson Andrew
text12 499 1267 29 1833 Jackson Andrew
text13 1315 4158 95 1837 Van Buren Martin
text14 1898 9123 210 1841 Harrison William Henry
text15 1334 5186 153 1845 Polk James Knox
text16 496 1178 22 1849 Taylor Zachary
text17 1165 3636 104 1853 Pierce Franklin
text18 945 3083 89 1857 Buchanan James
text19 1075 3999 135 1861 Lincoln Abraham
text20 360 775 26 1865 Lincoln Abraham
text21 485 1229 40 1869 Grant Ulysses S.
text22 552 1472 43 1873 Grant Ulysses S.
text23 831 2707 59 1877 Hayes Rutherford B.
text24 1021 3209 111 1881 Garfield James A.
text25 676 1816 44 1885 Cleveland Grover
text26 1352 4721 157 1889 Harrison Benjamin
text27 821 2125 58 1893 Cleveland Grover
text28 1232 4353 130 1897 McKinley William
text29 854 2437 100 1901 McKinley William
text30 404 1079 33 1905 Roosevelt Theodore
text31 1437 5821 158 1909 Taft William Howard
text32 658 1882 68 1913 Wilson Woodrow
text33 549 1652 59 1917 Wilson Woodrow
text34 1169 3719 148 1921 Harding Warren G.
text35 1220 4440 196 1925 Coolidge Calvin
text36 1090 3860 158 1929 Hoover Herbert
text37 743 2057 85 1933 Roosevelt Franklin D.
text38 725 1989 96 1937 Roosevelt Franklin D.
text39 526 1519 68 1941 Roosevelt Franklin D.
text40 275 633 27 1945 Roosevelt Franklin D.
text41 781 2504 116 1949 Truman Harry S.
text42 900 2743 119 1953 Eisenhower Dwight D.
text43 621 1907 92 1957 Eisenhower Dwight D.
text44 566 1541 52 1961 Kennedy John F.
text45 568 1710 93 1965 Johnson Lyndon Baines
text46 743 2416 103 1969 Nixon Richard Milhous
text47 544 1995 68 1973 Nixon Richard Milhous
text48 527 1369 52 1977 Carter Jimmy
text49 902 2780 129 1981 Reagan Ronald
text50 925 2909 123 1985 Reagan Ronald
text51 795 2673 141 1989 Bush George
text52 642 1833 81 1993 Clinton Bill
text53 773 2436 111 1997 Clinton Bill
text54 621 1806 97 2001 Bush George W.
text55 772 2312 99 2005 Bush George W.
text56 938 2689 110 2009 Obama Barack
text57 814 2317 88 2013 Obama Barack
text58 582 1660 88 2017 Trump Donald J.
text59 812 2766 216 2021 Biden Joseph R.
Party
none
none
Federalist
Democratic-Republican
Democratic-Republican
Democratic-Republican
Democratic-Republican
Democratic-Republican
Democratic-Republican
Democratic-Republican
Democratic
Democratic
Democratic
Whig
Whig
Whig
Democratic
Democratic
Republican
Republican
Republican
Republican
Republican
Republican
Democratic
Republican
Democratic
Republican
Republican
Republican
Republican
Democratic
Democratic
Republican
Republican
Republican
Democratic
Democratic
Democratic
Democratic
Democratic
Republican
Republican
Democratic
Democratic
Republican
Republican
Democratic
Republican
Republican
Republican
Democratic
Democratic
Republican
Republican
Democratic
Democratic
Republican
Democratic
Joe Biden’s had the largest number of sentences.
- Note that although we specified
text_field = "text"
when constructing the corpus, we have not removed the metadata associated with the texts. To access the other variables, we can use thedocvars()
function applied to the corpus object that we created above. Try this now.
1.3 Tokenizing texts
In order to count word frequencies, we first need to split the text into words (or longer phrases) through a process known as tokenization. Look at the documentation for quanteda
’s tokens()
function.
- Use the
tokens
command on corpus object that we ceated above, and examine the results.
- Experiment with some of the arguments of the
tokens()
function, such asremove_punct
andremove_numbers
.
Reveal code
- Try tokenizing the sentences from
data_corpus_inaugural
into sentences, usingtokens(x, what = "sentence")
.
1.4 Creating a dfm()
Document-feature matrices are the standard way of representing text as quantitative data. Fortunately, it is very simple to convert the tokens objects in quanteda into dfms.
- Create a document-feature matrix, using
dfm
applied to the tokenized object that you created above. First, read the documentation using?dfm
to see the available options. Once you have created the dfm, use thetopfeatures()
function to inspect the top 20 most frequently occurring features in the dfm. What kinds of words do you see?
Reveal code
Document-feature matrix of: 59 documents, 9,351 features (91.85% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives among
text1 1 71 116 1 48 2 2 1
text2 0 11 13 0 2 0 0 0
text3 3 140 163 1 130 0 2 4
text4 2 104 130 0 81 0 0 1
text5 0 101 143 0 93 0 0 7
text6 1 69 104 0 43 0 0 0
features
docs vicissitudes incident
text1 1 1
text2 0 0
text3 0 0
text4 0 0
text5 0 0
text6 0 0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,341 more features ]
the of and to in a our we that be is it for
10183 7180 5406 4591 2827 2292 2224 1827 1813 1502 1491 1398 1230
by have which not with as will
1091 1031 1007 980 970 966 944
Mostly stop words.
- Experiment with different
dfm_*
functions, such asdfm_wordstem()
,dfm_remove()
anddfm_trim()
. These functions allow you to reduce the size of the dfm following its construction. How does the number of features in your dfm change as you apply these functions to the dfm object you created in the question above?
Hint: You can use the dim()
function to see the number of rows and columns in your dfms.
Reveal code
- Use the
dfm_remove()
function to remove English-language stopwords from this data. You can get a list of English stopwords by usingstopwords("english")
.
Reveal code
Document-feature matrix of: 59 documents, 9,213 features (92.66% sparse) and 4 docvars.
features
docs fellow-citizens senate house representatives among vicissitudes
text1 1 1 2 2 1 1
text2 0 0 0 0 0 0
text3 3 1 0 2 4 0
text4 2 0 0 0 1 0
text5 0 0 0 0 7 0
text6 1 0 0 0 0 0
features
docs incident life event filled
text1 1 1 2 1
text2 0 0 0 0
text3 0 2 0 0
text4 0 1 0 0
text5 0 2 0 0
text6 0 1 0 1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,203 more features ]
1.5 Subsetting a Corpus
You can easily use quanteda to subset a corpus. There is a corpus_subset()
method defined for a corpus, which works just like R’s normal subset()
command. For instance if you want a wordcloud of just Obama’s two inaugural addresses, you would need to subset the corpus first:
obama_corpus <- corpus_subset(inaugural_corpus, President == "Obama")
obama_tokens <- tokens(obama_corpus)
obama_dfm <- dfm(obama_tokens)
textplot_wordcloud(obama_dfm)
- Try producing the same plot as above, but without the stopwords and without punctuation.
Hint: To remove stopwords, use dfm_remove()
. To remove punctuation, pass remove_punct = TRUE
to the tokens()
function.
1.6 Pipes
In the lecture we learned about the %>%
“pipe” operator, which allows us to chain together different functions so that the output of one function gets passed directly as input to another function. We can use these pipes to simplify our code and to make it somewhat easier to read.
For instance, we could join together the corpus-construction and tokenisation steps that we did separately above using a pipe:
inaugural_tokens <- inaugural %>% # Take the original data object
corpus(text_field = "text") %>% # ...convert to a corpus
tokens(remove_punct = TRUE) #...and then tokenize
inaugural_tokens[1]
Tokens consisting of 1 document and 4 docvars.
text1 :
[1] "Fellow-Citizens" "of" "the" "Senate"
[5] "and" "of" "the" "House"
[9] "of" "Representatives" "Among" "the"
[ ... and 1,418 more ]
- Write some code using the
%>%
operator that does the following: a) creates a corpus; b) tokenizes the texts; c) creates a dfm; d) removes stopwords; and e) reports the top features of the resulting dfm.
Reveal code
inaugural %>% # Take the original data object
corpus(text_field = "text") %>% # ...convert to a corpus
tokens(remove_punct = TRUE) %>% #... tokenize
dfm() %>% #...convert to a dfm
dfm_remove(pattern = stopwords("english")) %>% # ...remove stopwords
topfeatures() # ...report top features
people government us can must upon great
584 564 505 487 376 371 344
may states world
343 334 319
1.6.1 Descriptive statistics
- Use the
ntoken()
andntype()
functions on theinaugural_corpus
object. Create a plot showing the relationship between the quantities that you calculate from these functions.
Reveal code
- One simple measure of lexical diversity is the token-to-type ratio: \(\frac{\text{N type}}{\text{N token}}\). Calculate this ratio for each of the inaugural addresses using the objects that you created in the question above. Plot this measure against the
Year
variable associated with each of the texts.
Hint: Remember that you can access the metadata associated with each of the texts in the corpus using the docvars()
function.
1.7 Key-Words-In-Context
quanteda
provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Use the kwic()
function (for “keywords-in-context”) to explore how a specific word or phrase is used in this corpus (use the word-based tokenization that you implemented above). You can look at the help file (?kwic
) to see the arguments that the function takes.
- Use the
kwic()
function to see how the word “terror” is used
Hint: By default, kwic gives exact matches for a given pattern. What if we wanted to see words like “terrorism” and “terrorist” rather than exactly “terror”? We can use the wildcard character *
to expand our search by appending it to the end of the pattern we are using to search. For example, we could use "terror*"
.
Reveal code
Keyword-in-context with 8 matches.
[text3, 1190] or violence by | terror |
[text37, 99] nameless unreasoning unjustified | terror |
[text39, 258] by a fatalistic | terror |
[text44, 761] uncertain balance of | terror |
[text49, 700] Americans from the | terror |
[text53, 921] the fanaticism of | terror |
[text53, 1454] strong defense against | terror |
[text56, 1442] aims by inducing | terror |
intrigue or venality
which paralyzes needed
we proved that
that stays the
of runaway living
And they torment
and destruction Our
and slaughtering innocents
- Try substituting your own search terms to the
kwic()
function.
Reveal code
Keyword-in-context with 6 matches.
[text2, 59] people of united | America | Previous to the
[text3, 14] middle course for | America | remained between unlimited
[text3, 385] the people of | America | were not abandoned
[text3, 1272] the people of | America | have exhibited to
[text3, 1791] aboriginal nations of | America | and a disposition
[text3, 1929] the people of | America | and the internal
Keyword-in-context with 6 matches.
[text10, 1424] a confederated representative | democracy |
[text14, 497] to that of | democracy |
[text14, 1474] a simple representative | democracy |
[text14, 6900] the name of | democracy |
[text14, 7289] of devotion to | democracy |
[text34, 970] temple of representative | democracy |
were a government
If such is
or republic and
they speak warning
The foregoing remarks
to be not
1.8 Homework
The file sdg_goals_targets.csv
contains the text of the UN Sustainable Development Goals and Targets that we saw in the lecture. Load this data into your R session, create a corpus using the long_description
variable, tokenize the corpus, and create a dfm. Use the dfm to create a wordcloud.
Upload your plot to this Moodle page.