3  Supervised Learning for Text

3.1 Classifying NHS reviews

How satisfied are citizens with the healthcare services their governments provide? One interesting source of data on this question comes in the form of text, as users of healthcare services are often able to provide comments and feedback on the experiences they have when accessing government provided healthcare services. In this seminar, we will use Naive Bayes models to predict whether patient reviews of doctors’ surgeries in the UK are positive or negative. To do so, we will use a corpus of patient reviews of the UK’s National Health Service (NHS) facilities near to where they live.

3.1.1 Data

We will use one data source for today’s assignment.

  1. NHS Patient Reviewsnhs_reviews.Rdata

This file contains 2000 reviews of NHS doctors’ surgeries across the UK. The data contains the following variables:

Variables in the nhs_reviews data.
Variable Description
review_title The title of the patient’s review
review_text The text of the patient’s review
star_rating The star rating (out of five) that the patient gave
review_positive A categorical indicator equal to "Positive" if the patient gave 3 stars or more in their review, and "Negative" if they gave 1 or 2 stars
review_date The date of the review
gp_response A categorical variable which measures whether the doctors’ surgery provided a response to the patient’s review ("Responded") or has not yet provided a response ("Has not responded")

Once you have downloaded this file and stored it somewhere sensible, you can load it into R using the following command:

load("nhs_reviews.Rdata")

As ever, you can take a quick look at the variables in the data by using the glimpse() function from the tidyverse package:

glimpse(nhs_reviews)
Rows: 2,000
Columns: 6
$ review_title    <chr> "Great service", "Friendly helpful staff", "Excellent …
$ review_text     <chr> "Phoned up to book a appointment receptionist very hel…
$ star_rating     <dbl> 4, 5, 5, 1, 1, 5, 5, 5, 5, 1, 3, 5, 2, 5, 2, 5, 1, 1, …
$ review_positive <fct> Positive, Positive, Positive, Negative, Negative, Posi…
$ review_date     <date> 2021-10-13, 2021-07-26, 2021-09-18, 2021-06-19, 2021-…
$ gp_response     <chr> "Has not responded", "Responded", "Has not responded",…

3.1.2 Packages

You will need to load the following packages before beginning the assignment

library(tidyverse)
library(quanteda)
# Run the following code if you cannot load the quanteda.textmodels package
# install.packages("quanteda/quanteda.textmodels")
library(quanteda.textmodels)
# Run the following code if you cannot load the caret package
# install.packages("caret")
library(caret)

3.1.3 Random number generation

This seminar requires you to randomly generate a training set and a test set. Any procedure which requires random number generation can cause the results you get to fluctuate each time you run the code. In this case, rerunning the code will result in different observations being allocated into the test and training sets, for example.

To make the results fully replicable, you will therefore first need to use the set.seed() function, which takes as an argument a single integer value. Once you have specified a seed, R will generate the same sequence of random numbers every time you execute your code.

set.seed(12345)

I have used 12345 here, but you can choose any value you like. If you choose a different number, then your results will be somewhat different to mine. However, so long as you execute this line of code whenever you run your R scripts, you will get the same results each time!

3.2 Training and Test Sets

  1. Create a new variable in the nhs_reviews data that indicates whether an observation is a training set or test set observation. Ensure that approximately 80% of your observations in the training set and 20% are in the test set. Label this variable as train
Reveal code
nhs_reviews$train <- sample(x = c(TRUE, FALSE), 
                            size = nrow(nhs_reviews), 
                            replace = TRUE, 
                            prob = c(.8, .2))

The sample() function randomly samples from the elements provided to the x argument. Here, we are sampling from a vector which has only two elements: TRUE and FALSE. The size argument specifies the number of items to choose at random from x, and here we are asking to sample the same number of elements as there are observations in nhs_reviews data. The replace argument specifies whether we want to sample from x with or without replacement (we want to sample with replacement here), and the prob argument takes a vector which gives the probability with which we would like to sample each value of x.

  1. Convert the nhs_reviews data into a corpus and then into a DFM. Make some feature selection decisions.
Reveal code
# Convert to a corpus
nhs_reviews_corpus <- corpus(nhs_reviews, text_field = "review_text")

# Convert to a DFM
nhs_reviews_dfm <- nhs_reviews_corpus %>% 
  tokens() %>%
  dfm() %>%
  dfm_remove(stopwords("en"))

Remember that there is no single “right” way to make the feature selection decisions. Here, I have simply used a unigram representation and have removed stop words. You may make different decisions!

  1. Use the dfm_subset() function to subset your DFM into a training DFM and a test DFM. Use the dim() function to make sure that they have the numbers of rows and columns that you expect.
Reveal code
# Subset to training set observations
nhs_reviews_dfm_train <- dfm_subset(nhs_reviews_dfm, train)

# Subset to test set observations
nhs_reviews_dfm_test <- dfm_subset(nhs_reviews_dfm, !train)

Here I use the logical operator ! which reverses the TRUE and FALSE values of the train vector. So, the elements of train that are TRUE become FALSE when preceded by ! and the elements that are FALSE become TRUE.

dim(nhs_reviews_dfm_train)
[1] 1578 7906
dim(nhs_reviews_dfm_test)
[1]  422 7906

Our DFMs contain different numbers of documents (as they should, given that we specified above that we wanted 80% of our data to be training data and 20% to be test data) but they contain the same number of features (again, as they should, since we will want to create predictions for our test-set observations).

3.3 Dictionaries

  1. Construct a simple dictionary using the dictionary() function which includes some words that you think are likely to capture positive sentiment in the patient reviews. Apply the dictionary to the training data DFM that you created above and create a binary variable which is equal to Positive when the relevant text includes a word from your dictionary and Negative otherwise.
Reveal code
## Create the dictionary

positive_dictionary <- dictionary(list(positive = c("great", "excellent", "fantastic", "thank")))

# Apply the dictionary to the DFM
nhs_dictionary_dfm_train <- dfm_lookup(x = nhs_reviews_dfm_train,
                                      dictionary = positive_dictionary)

# Create a logical (TRUE/FALSE) vector with one element for each training set text
predicted_positive <- as.numeric(nhs_dictionary_dfm_train[,1]) > 0

# Convert the logical vector to a character vector, with TRUE entries equal to "Positive" and FALSE entries equal to "Negative". Assign the character vector to the training set data.
nhs_dictionary_dfm_train$predicted_positive_dictionary <- ifelse(predicted_positive, "Positive", "Negative")
  1. Use the table() function to create a confusion matrix which compares the predicted positive reviews (from your dictionary measure) to the true coding for positive reviews (which is stored in the review_positive variable). Save the output of the table() function as a new object.
Reveal code
confusion_dictionary <- table(predicted_classification = nhs_dictionary_dfm_train$predicted_positive_dictionary,
                              true_classification = nhs_dictionary_dfm_train$review_positive)

confusion_dictionary
                        true_classification
predicted_classification Negative Positive
                Negative      542      569
                Positive       36      421
  1. Use the confusionMatrix() function from the caret package to calculate a variety of performance statistics about your dictionary classifier.1 What is the accuracy of the predictions? What is the sensitivity? What is the specificity? Interpret these numbers. What do they tell us about the performance of the dictionary for classifying positive patient reviews?

1 You should also set the positive argument equal to "Positive" to tell R the level of the outcome that corresponds to a “positive” result.

Reveal code
confusion_dictionary_statistics <- confusionMatrix(confusion_dictionary, 
                                                   positive = "Positive")

confusion_dictionary_statistics
Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      542      569
                Positive       36      421
                                          
               Accuracy : 0.6142          
                 95% CI : (0.5895, 0.6383)
    No Information Rate : 0.6314          
    P-Value [Acc > NIR] : 0.9247          
                                          
                  Kappa : 0.3045          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.4253          
            Specificity : 0.9377          
         Pos Pred Value : 0.9212          
         Neg Pred Value : 0.4878          
             Prevalence : 0.6314          
         Detection Rate : 0.2685          
   Detection Prevalence : 0.2915          
      Balanced Accuracy : 0.6815          
                                          
       'Positive' Class : Positive        
                                          

Our dictionary produces predictions which have an accuracy of 61%. Although the specificity is high – 94% – this is because the dictionary is classifying the vast majority of reviews as not-positive, and so is correctly assigning that label to the true not-positive reviews. The sensitivity is much lower, at only 43%. That is, we are correctly classifying less than half of the true “positive” reviews correctly. This makes sense because the dictionary we are using above is very short and so is likely to be missing a lot of words that would indicate positivity.

We could try to improve these scores by changing the list of dictionary words and then trying again, but instead we will now use a Naive Bayes classification model to instead learn which words are most strongly associated with positive and negative patient reviews.

3.4 Naive Bayes

  1. Use the training DFM that you created earier to estimate a Naive Bayes model for the review_positive outcome using the textmodel_nb() function. This function takes three main arguments:
Argument Description
x The DFM on which the Naive Bayes model is trained.
y An outcome vector which we are trying to predict (here, review_positive).
prior The form of prior distribution for the categories of the outcome, y. Most typically we will set this to be "docfreq" which means that the prior probability for each category will be equal to the relative proportions of those classes in the training data.
Reveal code
nb_train <- textmodel_nb(x = nhs_reviews_dfm_train, 
                         y = nhs_reviews_dfm_train$review_positive,
                         prior = "docfreq")
  1. Examine the class-conditional word probabilities from your estimated Naive Bayes model.2 You can access these by using the coef() function on the estimated model object. Which words have the highest probabilities of occurring under each class?3

2 You will find the sort() function helpful here.

3 Note that your list of words might look different to the list here if you used different feature selection decisions at the beginning of the assignment. In particular, if you did not remove stopwords then you will find that both classes have high probability of those words (as they should, given that they are indeed common across all patient review texts).

Reveal code
head(sort(coef(nb_train)[,"Positive"], decreasing = T), 20)
          .           ,       staff    practice     surgery appointment 
0.075730102 0.037033087 0.009983574 0.009215607 0.008916953 0.007594343 
     always          gp     service           !        time     helpful 
0.006058408 0.006037076 0.005994411 0.005887749 0.005759754 0.005717089 
      thank      doctor        care        well     doctors    friendly 
0.005461100 0.004650468 0.004543806 0.004479809 0.004181155 0.004074493 
  reception         get 
0.003669177 0.003605180 
head(sort(coef(nb_train)[,"Negative"], decreasing = T), 20)
           .            ,  appointment          get         call      surgery 
 0.067154285  0.032439121  0.012286207  0.011866354  0.009015778  0.008662218 
           !         told        phone     practice         time           gp 
 0.008087683  0.007579440  0.007468953  0.006143103  0.006032616  0.005767446 
      doctor          day         back          one          can         just 
 0.004905644  0.004552084  0.004331109  0.004242719  0.004132231  0.004132231 
appointments            ? 
 0.004088036  0.004043841 

These words largely make sense. In both positive and negative reviews, patients are likely to use language related to healthcare – “staff”, “practice”, “surgery”, “appointment”, “doctor”, etc – but the language model for the “Positive” category also puts higher probabilities on many positively-valanced words such as “thank”, “helpful”, and “friendly”. By contrast, the negative category puts high probabilities on words that are likely to be linked to unsatisfactory service such as “time”, “service”, “day”, and “!”.

  1. Estimate the predicted probability of a positive review from the Naive Bayes model. Use these probabilities to examine the review_title variable: which patient review titles are most likely to be positive reviews? Which are most likely to be negative? Do these make sense?
Reveal code
nhs_reviews_dfm_train$positive_nb_probability <- predict(nb_train, type = "probability")[,2]

nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = T)[1:10]]
 [1] "Good GP Practice"                    "Efficient and compassionate service"
 [3] "Fantastic practice"                  "Caring and professional service"    
 [5] "Nurse was amazing"                   "Fantastic Practice"                 
 [7] "Consistently good care"              "Exemplary Practice"                 
 [9] "Fantastic practice"                  "Fantastic Practice"                 
nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = F)[1:10]]
 [1] "Shambles"                                    
 [2] "Unacceptable practice"                       
 [3] "Impossible to get a service"                 
 [4] "Unprofessional, Poor and Dangerous"          
 [5] "Impossible to get an appointment!!!"         
 [6] "Honestly, you’re better off not having a GP" 
 [7] "Disappointed with appointments."             
 [8] "Not so great practice!!"                     
 [9] "Very bad practice and arrogant receptionists"
[10] "Extremely rude receptionist"                 

Yes! It is very clear from these lists that the model is doing a reasonably good job of distinguishing between positive and negative reviews. To find out how good a job it is doing, we need to do more than a face validity check.

  1. Use your fitted Naive Bayes model to predict the category (positive or negative) for each observation in the training data. Then use the table() function to create a confusion matrix which compares the predicted positive reviews to the true coding for positive reviews (which is stored in the review_positive variable). As with the dictionary analysis, save the output of the table() function as a new object.
Reveal code
## Training set accuracy

nhs_reviews_dfm_train$predicted_classification_nb <- predict(nb_train, type = "class")

confusion_train <- table(predicted_classification = nhs_reviews_dfm_train$predicted_classification_nb, 
                         true_classification = nhs_reviews_dfm_train$review_positive)

confusion_train
                        true_classification
predicted_classification Negative Positive
                Negative      565       58
                Positive       13      932
  1. Calculate the accuracy, sensitivity and specificity of your classifier using the confusionMatrix() function. Compare these scores to those that you obtained from the dictionary analysis. Which performs best?
Reveal code
confusion_train_statistics <- confusionMatrix(confusion_train, 
                                              positive = "Positive")

confusion_train_statistics
Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      565       58
                Positive       13      932
                                          
               Accuracy : 0.9547          
                 95% CI : (0.9432, 0.9645)
    No Information Rate : 0.6314          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9043          
                                          
 Mcnemar's Test P-Value : 1.772e-07       
                                          
            Sensitivity : 0.9414          
            Specificity : 0.9775          
         Pos Pred Value : 0.9862          
         Neg Pred Value : 0.9069          
             Prevalence : 0.6314          
         Detection Rate : 0.5944          
   Detection Prevalence : 0.6027          
      Balanced Accuracy : 0.9595          
                                          
       'Positive' Class : Positive        
                                          

The Naive Bayes model clearly outperforms the dictionary analysis. The sensitivity of the classifier is now 94%, and the accuracy is 95%. We are doing a much better job at detecting sentiment in the training set using the Naive Bayes approach.

  1. Repeat the assessment of classifier accuracy but now for the test set observations. How does the classifier perform on these data compared to its performance on the training data?
Reveal code
## Test set accuracy

nhs_reviews_dfm_test$predicted_classification_nb <- predict(nb_train, 
                                                  newdata = nhs_reviews_dfm_test, 
                                                  type = "class")

confusion_test <- table(predicted_classification = nhs_reviews_dfm_test$predicted_classification_nb, 
                        true_classification = nhs_reviews_dfm_test$review_positive)

confusion_test_statistics <- confusionMatrix(confusion_test, 
                                             positive = "Positive")

confusion_test_statistics
Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      170       35
                Positive        6      207
                                          
               Accuracy : 0.9019          
                 95% CI : (0.8693, 0.9287)
    No Information Rate : 0.5789          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8032          
                                          
 Mcnemar's Test P-Value : 1.226e-05       
                                          
            Sensitivity : 0.8554          
            Specificity : 0.9659          
         Pos Pred Value : 0.9718          
         Neg Pred Value : 0.8293          
             Prevalence : 0.5789          
         Detection Rate : 0.4952          
   Detection Prevalence : 0.5096          
      Balanced Accuracy : 0.9106          
                                          
       'Positive' Class : Positive        
                                          

The accuracy on the test set is 90%, with a sensitivity of 86% and a specificity of 97%. The performance on the test set is a fraction worse than on the training set, though it is still significantly better than the dictionary analysis.

3.5 Homework

  1. Create two new DFMs from the nhs_reviews data, each time making different feature selection decisions (i.e. trimming, n-grams, stopword removal, etc). Train two new Naive Bayes models using these DFMs and compare their performance on the test set data to the performance of the model that you created above.

    Create a table comparing the results the models you have estimated. Report the feature selection decisions you made for each model. Which performs best in terms of accuracy, sensitivity and specificity ?

Upload your table to this Moodle page.

  1. (Hard question) Cross-validation

In this question you should implement k-fold cross-validation to assess the accuracy, sensitivity, and specificity of one of the Naive Bayes models that you estimated above.

  1. I have provided some starter code below for a function that will take a vector of logical values as input, which should be named held_out. That vector should take on TRUE values for observations in the held-out set and FALSE for observations in the training set for each fold of the cross-validation.

    Your job is to fill in the Your code goes here! sections of the code. To do so, you can essentially copy the material on slide 58 of this week’s lecture, making modifications to the existing task where appropriate.

get_performance_scores <- function(held_out){
  
  # Set up train and test sets for this fold
  dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
  dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
  
  # Train model on everything except held-out fold
  
      # Your code goes here!
  
  # Predict for held-out fold
  
      # Your code goes here!
  
  # Calculate accuracy, specificity, sensitivity
  
      # Your code goes here!

  # Save results in a data.frame
  
  return(output)
  
}
  1. Once you have completed this function, you will then need to create a vector to represent the \(k\) folds on which you are going to test the data in the cross validation. To do so, you will need to use the sample() function to produce a vector that has the same number of elements as there are rows in the nhs_reviews_dfm object. Again, you can see how to do this in the lecture slides.
  1. Finally, you will need to apply the get_function_scores() function to each fold. To do so, you may find the lapply() function helpful. This function takes a vector as an input, applies a function to that vector, and returns a list object which includes the results of the function for each value of the vector that you input. Again, see the lecture for example code.
  1. Report your result!