5  Designing Research Projects and Collecting Text Data

5.1 Using an API to Collect Text Data

An Application Programming Interface (API) is often the most convenient way to collect well-structured data from the web. We will cover an example here of getting data using the Guardian Newspaper API using the guardianapi package and analysing it using quanteda.

5.2 Packages

You will need to load the following packages before beginning the assignment

# Run the following code if you cannot load the guardianapi package:
# devtools::install_github("evanodell/guardianapi")
# Or, alternatively:
# install.packages("guardianapi")

5.2.1 Authentication

In theory,1 the Guardian API only requires very minimal effort to access. In particular, to access the API you need to register for a developer API key, which you can do by visiting this webpage. You will need to select the “Register developer key” option on that page:

  • 1 In my experience, the interfaces for both the APIs themselves, and the R packages desgined to access those APIs, are not terribly stable. This is an indirect way of saying: if this code doesn’t work for you, and you cannot access the API in class today, don’t blame me! Instead, think of it as a lesson of all the potential dangers you might face by using tools like this in your research process. Kevin Munger has a nice cautionary tale about this here.

  • Once you have clicked that button, you will be taken to a page where you can add details of your project.

    Fill in the details on that form as follows:

    You should receive an email asking you to confirm your UCL email address, and then a second email which contains your API key. It will be a long string like "g257t1jk-df09-4a0c-8ae5-101010d94e428". Make sure you save this key somewhere!

    You can then authenticate with the API by using the gu_api_key() function:


    When you run that function, you will see the following message appear in your R console.

    Please enter your API key and press enter:

    Paste the API key that you received via email into the console and you should see the following message:

    Updating gu.API.key session variable...

    You should now be able to use the API functions that are available in the guardianapi package! We will cover some of these functions below.

    5.2.2 Retrieve some newspaper articles

    We will start by using the gu_content() function to retrieve some data from the API. This function takes a number of arguments, some of the more important ones are listed in the table below:

    Arguments to the gu_content() function.
    Argument Description
    query A string containing the search query. Today, you can choose a simple query which will retrieve any newspaper article published in the Guardian that contains that term.
    from_date The start date that we would like to constrain our search. This argument should be a character of the form "YYYY-MM-DD". We will use "2021-01-01" today so that we will gather articles published on the 1st January 2021 or later.
    to_date The end date of our search. We will use "2021-12-31", so as to collect articles up to 31st December 2021.
    production_office The Guardian operates in several countries and this argument allows us to specify which version of the Guardian we would like to collect data from. We will set this to "UK" so that we collect news stories published in the UK.
    1. Execute the gu_content() function using the arguments as specified in the table above. There are two very important things to remember about this step:

      1. Don’t forget to save the output of this function to an object!
      2. Don’t run this function more times than you need to (hopefully just once). Each time you run the function you are making repeated calls to the Guardian API and if you use it too many times you will exceed your rate limit for the day and will have to wait until tomorrow for it to reset.
    Reveal code
    # You should only run this function once so as to not repeatedly make calls to the API
    gu_out <- gu_content(query = "YOUR_SEARCH_TERM_GOES_HERE", 
                         from_date = "2021-01-01", 
                         to_date = "2021-12-31",
                         production_office = "UK")

    I used the term "china" for the query argument, but you can select whatever search term you like.

    1. Save the object you have created.
    Reveal code
    save(gu_out, file = "gu_out.Rdata")

    You can then load the data file (if you need to) using the load() function as usual:

    load(file = "gu_out.Rdata")
    1. Inspect the output.
    Reveal code
    Rows: 5,009
    Columns: 46
    $ id                               <chr> "media/2021/dec/21/china-deletes-soci…
    $ type                             <chr> "article", "article", "article", "art…
    $ section_id                       <chr> "media", "books", "world", "world", "…
    $ section_name                     <chr> "Media", "Books", "World news", "Worl…
    $ web_publication_date             <dttm> 2021-12-21 01:00:00, 2021-11-28 01:0…
    $ web_title                        <chr> "China deletes social media accounts …
    $ web_url                          <chr> "https://www.theguardian.com/media/20…
    $ api_url                          <chr> "https://content.guardianapis.com/med…
    $ tags                             <list> [<data.frame[7 x 12]>], [<data.frame…
    $ is_hosted                        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA…
    $ pillar_id                        <chr> "pillar/news", "pillar/arts", "pillar…
    $ pillar_name                      <chr> "News", "Arts", "News", "News", "News…
    $ headline                         <chr> "China deletes social media accounts …
    $ standfirst                       <chr> "Huang Wei, known by username Viya, w…
    $ trail_text                       <chr> "Huang Wei, known by username Viya, w…
    $ byline                           <chr> "Vincent Ni", "Isabel Hilton", "Vince…
    $ main                             <chr> "<figure class=\"element element-imag…
    $ body                             <chr> "<p>China has deleted social media ac…
    $ wordcount                        <chr> "442", "1376", "455", "300", "995", "…
    $ first_publication_date           <dttm> 2021-12-21 01:00:00, 2021-11-28 01:0…
    $ is_inappropriate_for_sponsorship <chr> "false", "false", "false", "false", "…
    $ is_premoderated                  <chr> "false", "false", "false", "false", "…
    $ last_modified                    <chr> "2021-12-21T13:44:51Z", "2021-11-28T0…
    $ production_office                <chr> "UK", "UK", "UK", "UK", "AUS", "AUS",…
    $ publication                      <chr> "theguardian.com", "The Observer", "T…
    $ short_url                        <chr> "https://www.theguardian.com/p/k3tpd"…
    $ should_hide_adverts              <chr> "false", "false", "false", "false", "…
    $ show_in_related_content          <chr> "true", "true", "true", "true", "true…
    $ thumbnail                        <chr> "https://media.guim.co.uk/8f16b195e33…
    $ legally_sensitive                <chr> "false", "false", "false", "false", "…
    $ lang                             <chr> "en", "en", "en", "en", "en", "en", "…
    $ is_live                          <chr> "true", "true", "true", "true", "true…
    $ body_text                        <chr> "China has deleted social media accou…
    $ char_count                       <chr> "2790", "8338", "2779", "1877", "6403…
    $ should_hide_reader_revenue       <chr> "false", "false", "false", "false", "…
    $ show_affiliate_links             <chr> "false", "false", "false", "false", "…
    $ byline_html                      <chr> "<a href=\"profile/vincent-ni\">Vince…
    $ newspaper_page_number            <chr> NA, "43", "37", "30", "29", NA, NA, N…
    $ newspaper_edition_date           <date> NA, 2021-11-28, 2021-11-24, 2021-12-…
    $ sensitive                        <chr> NA, "true", NA, NA, NA, NA, NA, NA, N…
    $ comment_close_date               <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
    $ commentable                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ display_hint                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ live_blogging_now                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ star_rating                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ contributor_bio                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

    We have successfully retrieved 5009 articles that contain the term "china" in 2021.

    We get a lot of information about these articles! In addition to the text of the article (body_text), we also get the title of each article (web_title), the date of publication (first_publication_date), the section of the newspaper in which the article appeared (section_name), the author of the article (byline), as well as many other pieces of potentially useful metadata.

    5.2.3 Analyse the Guardian Texts

    1. Convert the texts of these articles (use the body_text variable) into a quanteda dfm object. How many features are in your dfm?
    Reveal code
    # Convert to corpus
    gu_corpus <- corpus(gu_out, text_field = "body_text")
    # Tokenize
    gu_tokens <- gu_corpus %>% 
      tokens(remove_punct = T,
             remove_symbols = T,
             remove_url = T)
    # Convert to DFM
    gu_dfm <- gu_tokens %>% 
      dfm() %>%
      dfm_remove(stopwords("en")) %>%
      dfm_trim(max_docfreq = .9,
               docfreq_type = "prop",
               min_termfreq = 10)
    1. What are the most common words in this corpus of tweets?
    Reveal code
    1. Using the methods we covered in the first four weeks of the course, find out something interesting about the data you have collected. Produce at least one plot and write a paragraph of text explaining what you have discovered.

    5.2.4 Homework

    Write a paragraph which describes a potential research project for this course. The paragraph should include the research question you intend to answer, the data that you would require to answer that question, and the method that you would apply for that purpose. You should indicate if you have already located the data needed, and how you intend to collect it for use in your project (can you just download it? Do you need to use an API? Do you need to do some web-scraping?). Upload your paragraph to this Moodle page.

    5.3 Web-Scraping

    Warning: Collecting data from the web (“web scraping”) is usually really annoying. There is no single function that will give you the information you need from a webpage. Instead, you must carefully and painfully write code that will give you what you want. If that sounds OK, then continue on with this problem set. If it doesn’t, stop here, and do something else.

    5.3.1 Packages

    You will need to load the following libraries to complete this part of the assignment (you may need to use install.packages() first):

    1. rvest is a nice package which helps you to scrape information from web pages.

    2. xml2 is a package which includes functions that can make it (somewhat) easier to navigate through html data that is loaded into R.

    5.3.2 Overview

    Throughout this course, the modal structure of a problem set has been that I give you a nice, clean, rectangular data.frame or tibble, which you use for the application of some fancy method. Here, I am going to walk through an example of getting the horrible, messy, and oddly-shaped data from a webpage, and turning it into a data.frame or tibble that is usable.

    Since no two websites are the same, web scraping requires you to identify the relevant parts of the html code that lies behind websites. The goal here is to parse the HTML into usable data. Generally speaking, there are three main steps for webscraping:

    1. Access a web page from R
    2. Tell R where to “look” on the page
    3. Manipulate the data in a usable format within R.
    4. (We don’t speak about step 4 so much, but it normally includes smacking your head against your desk, wondering where things went wrong and generally questioning all your life choices. But we won’t dwell on that here.)

    We are going to set ourselves a typical data science-type task in which we are going to scrape some data about politicians from their wiki pages. In particular, our task is to establish which universities were most popular amongst the crop of UK MPs who served in the House of Commons between 2017 and 2019. It is often useful to define in advance what the exact goal of the data collection task is. For us, we would like to finish with a data.frame or tibble that consists of one observations for each MP, and two variables: the MP’s name, and where they went to university.

    5.3.3 Step 1: Scrape a list of current MPs

    First, we need to know which MPs were in parliament in this period. A bit of googling shows that this wiki page gives us what we need. Scroll down a little, and you will see that there is a table where each row is an MP. It looks like this:

    The nice thing about this is that an html table like this should be reasonably easy to work with. We will need to be able to work with the underlying html code of the wiki page in what follows, so you will need to be able to see the source code of the website. If you don’t know how to look at the source code, follow the relevant instructions on this page for the browser that you are using.

    When you have figured that out, you should be able to see something that looks a bit like this:

    As you can see, html is horrible to look at. In R, we can read in the html code by using the read_html function from the rvest package:

    # Read in the raw html code of the wiki page
    mps_list_page <- read_html("https://en.wikipedia.org/wiki/List_of_United_Kingdom_MPs_by_seniority_(2017–2019)")

    read_html returns an XML document (to check, try running class(mps_list_page)), which makes navigating the different parts of the website (somewhat) easier.

    Now that we have the html code in R, we need to find the parts of the webpage that contain the table. Scroll through the source code that you should have open in your browser to see if you can find the parts of the code that contain the table we are interested in.

    On line 1154, you should see something like <table class="wikitable collapsible sortable" style="text-align: center; font-size: 85%; line-height: 14px;">. This marks the beginning of the table that we are interested in, and we can ask rvest to extract that table from our mps_list_page object by using the html_elements function.

    # Extract table of MPs
    mp_table <- html_elements(mps_list_page, 
                              css = "table[class='wikitable collapsible sortable']")

    Here, the string we pass to the css argument tells rvest that we would like to grab the table from the object mps_list_page that has the class wikitable collapsible sortable. The object we have created (mp_table) is itself an XML object, which is good, because we will need to navigate through that table to get the information we need.

    Now, within that table, we would like to extract two pieces of information for each MP: their name, and the link to their own individual wikipedia page. Looking back at the html source code, you should be able to see that each MP’s entry in the table is contained within its own separate <span> tag, and the information we are after is further nested within a <a> tag. For example, line 1250 includes the following:

    Yes, Bottomley is a funny name.

    We would like to extract all of these entries from the table, and we can do so by again using html_elements and using the appropriate css expression, which here is "span a", because the information we want is included in the a tag which itself is nested within the span tag.

    # Extract MP names and urls
    mp_table_entries <- html_elements(mp_table, "span a")
    {xml_nodeset (655)}
     [1] <a href="/wiki/Kenneth_Clarke" title="Kenneth Clarke">Kenneth Clarke</a>
     [2] <a href="/wiki/Dennis_Skinner" title="Dennis Skinner">Dennis Skinner</a>
     [3] <a href="/wiki/Peter_Bottomley" title="Peter Bottomley">Sir Peter Bottom ...
     [4] <a href="/wiki/Geoffrey_Robinson_(politician)" title="Geoffrey Robinson  ...
     [5] <a href="/wiki/Barry_Sheerman" title="Barry Sheerman">Barry Sheerman</a>
     [6] <a href="/wiki/Frank_Field_(British_politician)" title="Frank Field (Bri ...
     [7] <a href="/wiki/Harriet_Harman" title="Harriet Harman">Harriet Harman</a>
     [8] <a href="/wiki/Kevin_Barron" title="Kevin Barron">Sir Kevin Barron</a>
     [9] <a href="/wiki/Edward_Leigh" title="Edward Leigh">Sir Edward Leigh</a>
    [10] <a href="/wiki/Nick_Brown" title="Nick Brown">Nick Brown</a>
    [11] <a href="/wiki/Jeremy_Corbyn" title="Jeremy Corbyn">Jeremy Corbyn</a>
    [12] <a href="/wiki/David_Amess" title="David Amess">Sir David Amess</a>
    [13] <a href="/wiki/Roger_Gale" title="Roger Gale">Sir Roger Gale</a>
    [14] <a href="/wiki/Nicholas_Soames" title="Nicholas Soames">Sir Nicholas Soa ...
    [15] <a href="/wiki/Margaret_Beckett" title="Margaret Beckett">Dame Margaret  ...
    [16] <a href="/wiki/Bill_Cash" title="Bill Cash">Sir Bill Cash</a>
    [17] <a href="/wiki/Ann_Clwyd" title="Ann Clwyd">Ann Clwyd</a>
    [18] <a href="/wiki/Patrick_McLoughlin" title="Patrick McLoughlin">Sir Patric ...
    [19] <a href="/wiki/George_Howarth" title="George Howarth">Sir George Howarth ...
    [20] <a href="/wiki/John_Redwood" title="John Redwood">Sir John Redwood</a>

    Finally, now that we have the entry for each MP, it is very simple to extract the name of the MP and the URL to their wikipedia page:

    # html_text returns the text between the tags (here, the MPs' names)
    mp_names <- html_text(mp_table_entries) 
    # html_attr returns the attrubutes of the tags that you have named. Here we have asked for the "href" which will give us the link to each MP's own wiki page 
    mp_hrefs <- html_attr(mp_table_entries, 
                          name = "href") 
    # Combine into a tibble
    mps <- tibble(name = mp_names, url = mp_hrefs, university = NA, stringsAsFactors = FALSE)
    # A tibble: 6 × 4
      name                url                            university stringsAsFactors
      <chr>               <chr>                          <lgl>      <lgl>           
    1 Kenneth Clarke      /wiki/Kenneth_Clarke           NA         FALSE           
    2 Dennis Skinner      /wiki/Dennis_Skinner           NA         FALSE           
    3 Sir Peter Bottomley /wiki/Peter_Bottomley          NA         FALSE           
    4 Geoffrey Robinson   /wiki/Geoffrey_Robinson_(poli… NA         FALSE           
    5 Barry Sheerman      /wiki/Barry_Sheerman           NA         FALSE           
    6 Frank Field         /wiki/Frank_Field_(British_po… NA         FALSE           

    OK, OK, so those urls are not quite complete. We need to fix “https://en.wikipedia.org” to the front of them first. We can do that using the paste0() function:

    mps$url <- paste0("https://en.wikipedia.org", mps$url)
    # A tibble: 6 × 4
      name                url                            university stringsAsFactors
      <chr>               <chr>                          <lgl>      <lgl>           
    1 Kenneth Clarke      https://en.wikipedia.org/wiki… NA         FALSE           
    2 Dennis Skinner      https://en.wikipedia.org/wiki… NA         FALSE           
    3 Sir Peter Bottomley https://en.wikipedia.org/wiki… NA         FALSE           
    4 Geoffrey Robinson   https://en.wikipedia.org/wiki… NA         FALSE           
    5 Barry Sheerman      https://en.wikipedia.org/wiki… NA         FALSE           
    6 Frank Field         https://en.wikipedia.org/wiki… NA         FALSE           

    That’s better. Though, wait, how many observations are there in our data.frame?

    [1] 655   4

    655? But there are only 650 MPs in the House of Commons! Oh, I know why, it’s because some MPs will have left/died/been caught in a scandal and therefore have been replaced…

    Are you still here? Well done! We have something! We have…a list of MPs’ names! But we don’t have anything else. In particular, we still do not know where these people went to university. To find that, we have to move on to step 2.

    5.3.4 Step 2: Scrape the wiki page for each MP

    Let’s look at the page for the first MP in our list: https://en.wikipedia.org/wiki/Kenneth_Clarke. Scroll down the page, looking at the panel on the right-hand side. At the bottom of the panel, you will see this:

    The bottom line gives Clarke’s alma mater, which in this case is one of the Cambridge colleges. That is the information we are after. If we look at the html source code for this page, we can see that the alma mater line of the panel is enclosed in another <a> tag:

    Now that we know this, we can call in the html using read_html again:

    mp_text <- read_html(mps$url[1])

    And then we can use html_elements and html_text to extract the name of the university. Here we use a somewhat more complicated argument to find the information we are looking for. The xpath argument tells rvest to look for the tag a with a title of "Alma mater", and then asking rvest to look for the next a tag that comes after the alma mater tag. This is because the name of the university is actually stored in the subsequent a tag.

    mp_university <- html_elements(mp_text, 
                                   xpath = "//a[@title='Alma mater']/following::a[1]") %>%
    [1] "Gonville and Caius College, Cambridge"

    Regardless of whether you followed that last bit: it works! We now know where Kenneth Clarke went to university. Finally, we can assign the university that he went to to the mps tibble that we created earlier:

    mps$university[1] <- mp_university
    # A tibble: 6 × 4
      name                url                            university stringsAsFactors
      <chr>               <chr>                          <chr>      <lgl>           
    1 Kenneth Clarke      https://en.wikipedia.org/wiki… Gonville … FALSE           
    2 Dennis Skinner      https://en.wikipedia.org/wiki… <NA>       FALSE           
    3 Sir Peter Bottomley https://en.wikipedia.org/wiki… <NA>       FALSE           
    4 Geoffrey Robinson   https://en.wikipedia.org/wiki… <NA>       FALSE           
    5 Barry Sheerman      https://en.wikipedia.org/wiki… <NA>       FALSE           
    6 Frank Field         https://en.wikipedia.org/wiki… <NA>       FALSE           

    5.3.5 Scraping exercises

    1. Figure out how to collect this university information for all of the other MPs in the data. You will need to write a for-loop, which iterates over the URLs in the data.frame we just constructed and pulls out the relevant information from each MP’s wiki page. You will find very quickly that web-scraping is a messy business, and your loop will probably fail. You might want to use the stop, next, try and if functions to help avoid problems.
    Show solution

    A for-loop is pretty easy to set up given the code provided above. We just need to loop over each row of the mps object, read in the html, find the university, and assign it to the relevant cell in the data.frame. E.g.

    for(i in 1:nrow(mps)){
      mp_text <- read_html(mps$url[i])
      mp_university <- html_elements(mp_text, 
                                   xpath = "//a[@title='Alma mater']/following::a[1]") %>%
      mps$university[i] <- mp_university

    Here, cat('.') is just a piece of convenience code that will print out a dot to the console on every iteration of the loop. This just helps us to know that R hasn’t crashed or that nothing is happening. It’s also quite satisfying to know that every time a dot appears, that means that you have collected some new data.

    However, if you try running that code, you’ll see that it will cut out after a short while with an error.

    The main difficulty with this exercise is that there are essentially an infinite number of ways in which data scraping can go wrong. Here, the main problems is that some of the MPs do not actually have any information recorded in their wiki profiles about the university that they attended. Look at the page for Ronnie Campbell for example. Never went to university, but certainly looks like a happy chap.

    Because of that, we need to build in some code into the loop that says ‘OK, if you can’t find any information about this MP’s university, just code it as NA.’ I’ve added a line that does this to the loop.

    for(i in 1:nrow(mps)){
      mp_text <- read_html(mps$url[i])
      mp_university <- xml_text(xml_find_all(mp_text, xpath = "//a[@title='Alma mater']/following::a[1]"))
      if(length(mp_university)==0) mp_university <- NA
      mps$university[i] <- mp_university

    Now the loop runs without breaking! Hooray!

    (It is worth noting that this is a very simple example. In the typical web-scraping exercise, you should expect considerably more frustration than you have encountered here. :) Enjoy!)

    1. Which was the modal university for the current set of UK MPs?
    Show solution

    There are a number of ways of finding this out, for example:

    sort(table(mps$university), decreasing = T)[1]
    London School of Economics 

    So, LSE is the most popular university for MPs? That seems…unlikely… And indeed it is. Remember the Kenneth Clarke example: wiki lists the college he attended in Cambridge, not just the university. Maybe lots of MPs went to Cambridge, but they all just went to different colleges? Let’s check:

     [1] "Gonville and Caius College, Cambridge"
     [2] "Trinity College, Cambridge"           
     [3] "Clare College, Cambridge"             
     [4] "Newnham College, Cambridge"           
     [5] "Sidney Sussex College, Cambridge"     
     [6] "Pembroke College, Cambridge"          
     [7] "Fitzwilliam College, Cambridge"       
     [8] "Corpus Christi College, Cambridge"    
     [9] "Emmanuel College, Cambridge"          
    [10] "Christ's College, Cambridge"          
    [11] "St John's College, Cambridge"         
    [12] "Jesus College, Cambridge"             
    [13] "Magdalene College, Cambridge"         
    [14] "Downing College, Cambridge"           
    [15] "Robinson College, Cambridge"          
    [16] "St Catharine's College,Cambridge"     
    [17] "King's College, Cambridge"            
    [18] "Girton College, Cambridge"            
    [19] "Queens' College, Cambridge"           
    [20] "Peterhouse, Cambridge"                
    [21] "University of Cambridge"              
    [22] "Corpus Christi College,Cambridge"     
    [23] "Pembroke College,Cambridge"           
    [24] "Trinity Hall, Cambridge"              
    [25] "Selwyn College, Cambridge"            

    Oh dear. Maybe it is the same for Oxford?

     [1] "Lincoln College, Oxford"                 
     [2] "Magdalen College, Oxford"                
     [3] "St John's College, Oxford"               
     [4] "St Edmund Hall, Oxford"                  
     [5] "Balliol College, Oxford"                 
     [6] "University College, Oxford"              
     [7] "St Hugh's College, Oxford"               
     [8] "Pembroke College, Oxford"                
     [9] "New College, Oxford"                     
    [10] "Oxford Polytechnic"                      
    [11] "Exeter College, Oxford"                  
    [12] "Lady Margaret Hall, Oxford"              
    [13] "Brasenose College, Oxford"               
    [14] "Merton College, Oxford"                  
    [15] "Somerville College, Oxford"              
    [16] "St Hilda's College, Oxford"              
    [17] "Corpus Christi College, Oxford"          
    [18] "Keble College, Oxford"                   
    [19] "Jesus College, Oxford"                   
    [20] "Trinity College, Oxford"                 
    [21] "Mansfield College, Oxford"               
    [22] "St Benet's Hall, Oxford"                 
    [23] "Christ Church, Oxford"                   
    [24] "Hertford College, Oxford"                
    [25] "Oxford Brookes"                          
    [26] "University College, University of Oxford"
    [27] "Wadham College, Oxford"                  
    [28] "St Anne's College, Oxford"               
    [29] "Greyfriars, Oxford"                      
    [30] "Oriel College, Oxford"                   
    [31] "University of Oxford"                    
    [32] "St. Hilda's College, Oxford"             
    [33] "St Catherine's College, Oxford"          


    Right, so we need to do some recoding. Let’s create a new variable that we can use to simplify the universities coding:

    mps$university_new <- mps$university
    mps$university_new[grep("Cambridge",mps$university)] <- "Cambridge"
    mps$university_new[grep("Oxford",mps$university)] <- "Oxford"
    mps$university_new[grep("London School of Economics",mps$university)] <- "LSE"
    head(sort(table(mps$university_new), decreasing = T))
                     Oxford               Cambridge                     LSE 
                         85                      46                      17 
    University of Edinburgh      University of Hull   University of Glasgow 
                         12                      12                      11 

    Looks like the Oxbridge connection is still pretty strong!

    1. Go back to the scraping code and see if you can add some more variables to the tibble. Can you scrape the MPs’ party affiliations? Can you scrape their date of birth? Doing so will require you to look carefully at the html source code, and work out the appropriate xpath expression to use. For guidance on xpath, see here.
    Show solution
    mps$university <- NA
    mps$party <- NA
    mps$birthday <- NA
    for(i in 1:nrow(mps)){
      mp_text <- read_html(mps$url[i])
      mp_university <- html_elements(mp_text, xpath = "//a[@title='Alma mater']/following::a[1]") %>% 
      mp_party <- html_elements(mp_text, xpath = "////tr/th[text()='Political party']/following::a[1]") %>% 
      mp_birthday <- html_elements(mp_text, xpath = "//span[@class='bday']") %>% 
      if(length(mp_university)==0) mp_university <- NA
      if(length(mp_party)==0) mp_party <- NA
      if(length(mp_birthday)==0) mp_birthday <- NA
      mps$university[i] <- mp_university
      mps$party[i] <- mp_party
      mps$birthday[i] <- mp_birthday
    # A tibble: 6 × 7
      name           url   university stringsAsFactors university_new party birthday
      <chr>          <chr> <chr>      <lgl>            <chr>          <chr> <chr>   
    1 Kenneth Clarke http… Gonville … FALSE            Cambridge      Cons… 1940-07…
    2 Dennis Skinner http… Ruskin Co… FALSE            Ruskin College Labo… 1932-02…
    3 Sir Peter Bot… http… Trinity C… FALSE            Cambridge      Cons… 1944-07…
    4 Geoffrey Robi… http… Clare Col… FALSE            Cambridge      Labo… 1938-05…
    5 Barry Sheerman http… London Sc… FALSE            LSE            Labo… 1940-08…
    6 Frank Field    http… Universit… FALSE            University of… cros… 1942-07…
    1. If you got this far, well done! In your homework upload you should tell me how many MPs went to the LSE and I will shower you with praise.