3 Language Models (I): Supervised Learning for Text – PUBL0099: Quantitative Text Analysis for Social Science

3.1 Lecture slides

3.2 Classifying NHS reviews

How satisfied are citizens with the healthcare services their governments provide? One interesting source of data on this question comes in the form of text, as users of healthcare services are often able to provide comments and feedback on the experiences they have when accessing government provided healthcare services. In this seminar, we will use Naive Bayes models to predict whether patient reviews of doctors’ surgeries in the UK are positive or negative. To do so, we will use a corpus of patient reviews of the UK’s National Health Service (NHS) facilities near to where they live.

3.2.1 Packages

You will need to load the following packages before beginning the assignment

library(tidyverse)
library(quanteda)
# Run the following code if you cannot load the quanteda.textmodels package
# ONLY WORKS FOR WINDOWS USERS, SEE COMMENTS BELOW FOR MAC USERS
# remotes::install_github("quanteda/quanteda.textmodels")
library(quanteda.textmodels)
# Run the following code if you cannot load the caret package
# install.packages("caret")
library(caret)

3.2.2 Instructions for Mac users having problems loading `quanteda.textmodels`

For Mac users, there is currently an issue with installing the quanteda.textmodels package using either the install.packages() function or the remotes::install_github() function. Instead, you will need to use the following (somewhat labourious) workaround.

Install gfortran on your computer, finding the appropriate version at the following link: https://github.com/fxcoudert/gfortran-for-macOS/releases
Run these lines in R:

dir.create('~/.R')
file.create('~/.R/Makevars')

Locate the Makevars file. To locate the relevant file, go to your home directory via finder (you can open finder and then press ⇧⌘H). Then, if you cannot see the .R folder, you first need to press ⇧⌘. which will reveal the hidden files on your mac. Locate the .R folder, open it, and inside should be the Makevars file. Once you have done that you need to copy-paste these lines into that file and save it:

FC = /opt/homebrew/Cellar/gcc/14.2.0/bin/gfortran
F77 = /opt/homebrew/Cellar/gcc/14.2.0/bin/gfortran
FLIBS = -L/opt/homebrew/Cellar/gcc/14.2.0/lib/gcc/11

Now restart R
Then run this:

remotes::install_github("quanteda/quanteda.textmodels")

3.2.3 Data

We will use one data source for today’s assignment.

NHS Patient Reviews – nhs_reviews.Rdata

This file contains 2000 reviews of NHS doctors’ surgeries across the UK. The data contains the following variables:

Variables in the `nhs_reviews` data.
Variable	Description
`review_title`	The title of the patient’s review
`review_text`	The text of the patient’s review
`star_rating`	The star rating (out of five) that the patient gave
`review_positive`	A categorical indicator equal to `"Positive"` if the patient gave 3 stars or more in their review, and `"Negative"` if they gave 1 or 2 stars
`review_date`	The date of the review
`gp_response`	A categorical variable which measures whether the doctors’ surgery provided a response to the patient’s review (`"Responded"`) or has not yet provided a response (`"Has not responded"`)

Once you have downloaded this file and stored it somewhere sensible, you can load it into R using the following command:

load("nhs_reviews.Rdata")

As ever, you can take a quick look at the variables in the data by using the glimpse() function from the tidyverse package:

glimpse(nhs_reviews)

Rows: 2,000
Columns: 6
$ review_title    <chr> "Great service", "Friendly helpful staff", "Excellent …
$ review_text     <chr> "Phoned up to book a appointment receptionist very hel…
$ star_rating     <dbl> 4, 5, 5, 1, 1, 5, 5, 5, 5, 1, 3, 5, 2, 5, 2, 5, 1, 1, …
$ review_positive <fct> Positive, Positive, Positive, Negative, Negative, Posi…
$ review_date     <date> 2021-10-13, 2021-07-26, 2021-09-18, 2021-06-19, 2021-…
$ gp_response     <chr> "Has not responded", "Responded", "Has not responded",…

3.2.4 Random number generation

This seminar requires you to randomly generate a training set and a test set. Any procedure which requires random number generation can cause the results you get to fluctuate each time you run the code. In this case, rerunning the code will result in different observations being allocated into the test and training sets, for example.

To make the results fully replicable, you will therefore first need to use the set.seed() function, which takes as an argument a single integer value. Once you have specified a seed, R will generate the same sequence of random numbers every time you execute your code.

set.seed(12345)

I have used 12345 here, but you can choose any value you like. If you choose a different number, then your results will be somewhat different to mine. However, so long as you execute this line of code whenever you run your R scripts, you will get the same results each time!

3.3 Training and Test Sets

Create a new variable in the nhs_reviews data that indicates whether an observation is a training set or test set observation. Ensure that approximately 80% of your observations in the training set and 20% are in the test set. Label this variable as train

Reveal code

nhs_reviews$train <- sample(x = c(TRUE, FALSE), 
                            size = nrow(nhs_reviews), 
                            replace = TRUE, 
                            prob = c(.8, .2))

The sample() function randomly samples from the elements provided to the x argument. Here, we are sampling from a vector which has only two elements: TRUE and FALSE. The size argument specifies the number of items to choose at random from x, and here we are asking to sample the same number of elements as there are observations in nhs_reviews data. The replace argument specifies whether we want to sample from x with or without replacement (we want to sample with replacement here), and the prob argument takes a vector which gives the probability with which we would like to sample each value of x.

Convert the nhs_reviews data into a corpus and then into a DFM. Make some feature selection decisions.

Reveal code

# Convert to a corpus
nhs_reviews_corpus <- corpus(nhs_reviews, text_field = "review_text")

# Convert to a DFM
nhs_reviews_dfm <- nhs_reviews_corpus %>% 
  tokens() %>%
  dfm() %>%
  dfm_remove(stopwords("en"))

Remember that there is no single “right” way to make the feature selection decisions. Here, I have simply used a unigram representation and have removed stop words. You may make different decisions!

Use the dfm_subset() function to subset your DFM into a training DFM and a test DFM. Use the dim() function to make sure that they have the numbers of rows and columns that you expect.

Reveal code

# Subset to training set observations
nhs_reviews_dfm_train <- dfm_subset(nhs_reviews_dfm, train)

# Subset to test set observations
nhs_reviews_dfm_test <- dfm_subset(nhs_reviews_dfm, !train)

Here I use the logical operator ! which reverses the TRUE and FALSE values of the train vector. So, the elements of train that are TRUE become FALSE when preceded by ! and the elements that are FALSE become TRUE.

dim(nhs_reviews_dfm_train)

[1] 1578 7862

dim(nhs_reviews_dfm_test)

[1]  422 7862

Our DFMs contain different numbers of documents (as they should, given that we specified above that we wanted 80% of our data to be training data and 20% to be test data) but they contain the same number of features (again, as they should, since we will want to create predictions for our test-set observations).

3.4 Dictionaries

Construct a simple dictionary using the dictionary() function which includes some words that you think are likely to capture positive sentiment in the patient reviews. Apply the dictionary to the training data DFM that you created above and create a binary variable which is equal to Positive when the relevant text includes a word from your dictionary and Negative otherwise.

Reveal code

## Create the dictionary

positive_dictionary <- dictionary(list(positive = c("great", "excellent", "fantastic", "thank")))

# Apply the dictionary to the DFM
nhs_dictionary_dfm_train <- dfm_lookup(x = nhs_reviews_dfm_train,
                                      dictionary = positive_dictionary)

# Create a logical (TRUE/FALSE) vector with one element for each training set text
predicted_positive <- as.numeric(nhs_dictionary_dfm_train[,1]) > 0

# Convert the logical vector to a character vector, with TRUE entries equal to "Positive" and FALSE entries equal to "Negative". Assign the character vector to the training set data.
nhs_dictionary_dfm_train$predicted_positive_dictionary <- ifelse(predicted_positive, "Positive", "Negative")

Use the table() function to create a confusion matrix which compares the predicted positive reviews (from your dictionary measure) to the true coding for positive reviews (which is stored in the review_positive variable). Save the output of the table() function as a new object.

Reveal code

confusion_dictionary <- table(predicted_classification = nhs_dictionary_dfm_train$predicted_positive_dictionary,
                              true_classification = nhs_dictionary_dfm_train$review_positive)

confusion_dictionary

                        true_classification
predicted_classification Negative Positive
                Negative      542      569
                Positive       36      421

Use the confusionMatrix() function from the caret package to calculate a variety of performance statistics about your dictionary classifier.¹ What is the accuracy of the predictions? What is the sensitivity? What is the specificity? Interpret these numbers. What do they tell us about the performance of the dictionary for classifying positive patient reviews?

¹ You should also set the positive argument equal to "Positive" to tell R the level of the outcome that corresponds to a “positive” result.

Reveal code

confusion_dictionary_statistics <- confusionMatrix(confusion_dictionary, 
                                                   positive = "Positive")

confusion_dictionary_statistics

Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      542      569
                Positive       36      421
                                          
               Accuracy : 0.6142          
                 95% CI : (0.5895, 0.6383)
    No Information Rate : 0.6314          
    P-Value [Acc > NIR] : 0.9247          
                                          
                  Kappa : 0.3045          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.4253          
            Specificity : 0.9377          
         Pos Pred Value : 0.9212          
         Neg Pred Value : 0.4878          
             Prevalence : 0.6314          
         Detection Rate : 0.2685          
   Detection Prevalence : 0.2915          
      Balanced Accuracy : 0.6815          
                                          
       'Positive' Class : Positive

Our dictionary produces predictions which have an accuracy of 61%. Although the specificity is high – 94% – this is because the dictionary is classifying the vast majority of reviews as not-positive, and so is correctly assigning that label to the true not-positive reviews. The sensitivity is much lower, at only 43%. That is, we are correctly classifying less than half of the true “positive” reviews correctly. This makes sense because the dictionary we are using above is very short and so is likely to be missing a lot of words that would indicate positivity.

We could try to improve these scores by changing the list of dictionary words and then trying again, but instead we will now use a Naive Bayes classification model to instead learn which words are most strongly associated with positive and negative patient reviews.

3.5 Naive Bayes

Use the training DFM that you created earier to estimate a Naive Bayes model for the review_positive outcome using the textmodel_nb() function. This function takes three main arguments:

Argument	Description
`x`	The DFM on which the Naive Bayes model is trained.
`y`	An outcome vector which we are trying to predict (here, `review_positive`).
`prior`	The form of prior distribution for the categories of the outcome, `y`. Most typically we will set this to be `"docfreq"` which means that the prior probability for each category will be equal to the relative proportions of those classes in the training data.

Reveal code

nb_train <- textmodel_nb(x = nhs_reviews_dfm_train, 
                         y = nhs_reviews_dfm_train$review_positive,
                         prior = "docfreq")

Examine the class-conditional word probabilities from your estimated Naive Bayes model.² You can access these by using the coef() function on the estimated model object. Which words have the highest probabilities of occurring under each class?³

² You will find the sort() function helpful here.

³ Note that your list of words might look different to the list here if you used different feature selection decisions at the beginning of the assignment. In particular, if you did not remove stopwords then you will find that both classes have high probability of those words (as they should, given that they are indeed common across all patient review texts).

Reveal code

head(sort(coef(nb_train)[,"Positive"], decreasing = T), 20)

          .           ,       staff    practice     surgery appointment 
0.075829687 0.037050475 0.009988262 0.009241276 0.008921140 0.007619251 
     always          gp     service           !        time     helpful 
0.006061253 0.006061253 0.005997225 0.005890513 0.005762459 0.005719774 
      thank      doctor        care        well     doctors    friendly 
0.005463664 0.004652652 0.004545940 0.004481912 0.004183118 0.004076406 
  reception         get 
0.003670900 0.003606872

head(sort(coef(nb_train)[,"Negative"], decreasing = T), 20)

           .            ,  appointment          get         call      surgery 
 0.067243057  0.032460640  0.012294357  0.011874226  0.009021758  0.008667964 
           !         told        phone     practice         time           gp 
 0.008093048  0.007584468  0.007473908  0.006147178  0.006036618  0.005771272 
      doctor          day         back          one          can         just 
 0.004908898  0.004555103  0.004333982  0.004245533  0.004134973  0.004134973 
appointments            ? 
 0.004112860  0.004046524

These words largely make sense. In both positive and negative reviews, patients are likely to use language related to healthcare – “staff”, “practice”, “surgery”, “appointment”, “doctor”, etc – but the language model for the “Positive” category also puts higher probabilities on many positively-valanced words such as “thank”, “helpful”, and “friendly”. By contrast, the negative category puts high probabilities on words that are likely to be linked to unsatisfactory service such as “time”, “service”, “day”, and “!”.

Estimate the predicted probability of a positive review from the Naive Bayes model. Use these probabilities to examine the review_title variable: which patient review titles are most likely to be positive reviews? Which are most likely to be negative? Do these make sense?

Reveal code

nhs_reviews_dfm_train$positive_nb_probability <- predict(nb_train, type = "probability")[,2]

nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = T)[1:10]]

 [1] "Good GP Practice"                    "Efficient and compassionate service"
 [3] "Fantastic practice"                  "Caring and professional service"    
 [5] "Nurse was amazing"                   "Fantastic Practice"                 
 [7] "Consistently good care"              "Exemplary Practice"                 
 [9] "Fantastic practice"                  "Fantastic Practice"

nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = F)[1:10]]

 [1] "Shambles"                                    
 [2] "Unacceptable practice"                       
 [3] "Impossible to get a service"                 
 [4] "Unprofessional, Poor and Dangerous"          
 [5] "Impossible to get an appointment!!!"         
 [6] "Honestly, you’re better off not having a GP" 
 [7] "Disappointed with appointments."             
 [8] "Not so great practice!!"                     
 [9] "Very bad practice and arrogant receptionists"
[10] "Extremely rude receptionist"

Yes! It is very clear from these lists that the model is doing a reasonably good job of distinguishing between positive and negative reviews. To find out how good a job it is doing, we need to do more than a face validity check.

Use your fitted Naive Bayes model to predict the category (positive or negative) for each observation in the training data. Then use the table() function to create a confusion matrix which compares the predicted positive reviews to the true coding for positive reviews (which is stored in the review_positive variable). As with the dictionary analysis, save the output of the table() function as a new object.

Reveal code

## Training set accuracy

nhs_reviews_dfm_train$predicted_classification_nb <- predict(nb_train, type = "class")

confusion_train <- table(predicted_classification = nhs_reviews_dfm_train$predicted_classification_nb, 
                         true_classification = nhs_reviews_dfm_train$review_positive)

confusion_train

                        true_classification
predicted_classification Negative Positive
                Negative      565       59
                Positive       13      931

Calculate the accuracy, sensitivity and specificity of your classifier using the confusionMatrix() function. Compare these scores to those that you obtained from the dictionary analysis. Which performs best?

Reveal code

confusion_train_statistics <- confusionMatrix(confusion_train, 
                                              positive = "Positive")

confusion_train_statistics

Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      565       59
                Positive       13      931
                                          
               Accuracy : 0.9541          
                 95% CI : (0.9425, 0.9639)
    No Information Rate : 0.6314          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.903           
                                          
 Mcnemar's Test P-Value : 1.137e-07       
                                          
            Sensitivity : 0.9404          
            Specificity : 0.9775          
         Pos Pred Value : 0.9862          
         Neg Pred Value : 0.9054          
             Prevalence : 0.6314          
         Detection Rate : 0.5938          
   Detection Prevalence : 0.6020          
      Balanced Accuracy : 0.9590          
                                          
       'Positive' Class : Positive

The Naive Bayes model clearly outperforms the dictionary analysis. The sensitivity of the classifier is now 94%, and the accuracy is 95%. We are doing a much better job at detecting sentiment in the training set using the Naive Bayes approach.

Repeat the assessment of classifier accuracy but now for the test set observations. How does the classifier perform on these data compared to its performance on the training data?

Reveal code

## Test set accuracy

nhs_reviews_dfm_test$predicted_classification_nb <- predict(nb_train, 
                                                  newdata = nhs_reviews_dfm_test, 
                                                  type = "class")

confusion_test <- table(predicted_classification = nhs_reviews_dfm_test$predicted_classification_nb, 
                        true_classification = nhs_reviews_dfm_test$review_positive)

confusion_test_statistics <- confusionMatrix(confusion_test, 
                                             positive = "Positive")

confusion_test_statistics

Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      170       35
                Positive        6      207
                                          
               Accuracy : 0.9019          
                 95% CI : (0.8693, 0.9287)
    No Information Rate : 0.5789          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8032          
                                          
 Mcnemar's Test P-Value : 1.226e-05       
                                          
            Sensitivity : 0.8554          
            Specificity : 0.9659          
         Pos Pred Value : 0.9718          
         Neg Pred Value : 0.8293          
             Prevalence : 0.5789          
         Detection Rate : 0.4952          
   Detection Prevalence : 0.5096          
      Balanced Accuracy : 0.9106          
                                          
       'Positive' Class : Positive

The accuracy on the test set is 90%, with a sensitivity of 86% and a specificity of 97%. The performance on the test set is a fraction worse than on the training set, though it is still significantly better than the dictionary analysis.

3.6 Homework

Create two new DFMs from the nhs_reviews data, each time making different feature selection decisions (i.e. trimming, n-grams, stopword removal, etc). Train two new Naive Bayes models using these DFMs and compare their performance on the test set data to the performance of the model that you created above.

Create a table comparing the results the models you have estimated. Report the feature selection decisions you made for each model. Which performs best in terms of accuracy, sensitivity and specificity ?

Upload your table to this Moodle page.

Reveal code

## No punctuation DFM

# Convert to a DFM
nhs_reviews_dfm_nopunct <- nhs_reviews_corpus %>% 
  tokens(remove_punct = T) %>%
  dfm() 

# Subset to training set observations
nhs_reviews_dfm_nopunct_train <- dfm_subset(nhs_reviews_dfm_nopunct, train)

# Subset to test set observations
nhs_reviews_dfm_nopunct_test <- dfm_subset(nhs_reviews_dfm_nopunct, !train)

# Train a new Naive Bayes model
nb_train_nopunct <- textmodel_nb(x = nhs_reviews_dfm_nopunct_train, 
                         y = nhs_reviews_dfm_nopunct_train$review_positive,
                         prior = "docfreq")

## Test set accuracy

nhs_reviews_dfm_nopunct_test$predicted_classification <- predict(nb_train_nopunct, 
                                                  newdata = nhs_reviews_dfm_nopunct_test, 
                                                  type = "class")

confusion_test_nopunct <- table(predicted_classification = nhs_reviews_dfm_nopunct_test$predicted_classification, 
                        true_classification = nhs_reviews_dfm_nopunct_test$review_positive)

confusion_test_nopunct_statistics <- confusionMatrix(confusion_test_nopunct, positive = "Positive")

confusion_test_nopunct_statistics

Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      169       34
                Positive        7      208
                                          
               Accuracy : 0.9019          
                 95% CI : (0.8693, 0.9287)
    No Information Rate : 0.5789          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8029          
                                          
 Mcnemar's Test P-Value : 4.896e-05       
                                          
            Sensitivity : 0.8595          
            Specificity : 0.9602          
         Pos Pred Value : 0.9674          
         Neg Pred Value : 0.8325          
             Prevalence : 0.5789          
         Detection Rate : 0.4976          
   Detection Prevalence : 0.5144          
      Balanced Accuracy : 0.9099          
                                          
       'Positive' Class : Positive

## N-gram model

# Convert to a DFM
nhs_reviews_dfm_ngram <- nhs_reviews_corpus %>% 
  tokens() %>%
  tokens_ngrams(1:3) %>%
  dfm()

# Subset to training set observations
nhs_reviews_dfm_ngram_train <- dfm_subset(nhs_reviews_dfm_ngram, train)

# Subset to test set observations
nhs_reviews_dfm_ngram_test <- dfm_subset(nhs_reviews_dfm_ngram, !train)

# Train a new Naive Bayes model
nb_train_ngram <- textmodel_nb(x = nhs_reviews_dfm_ngram_train, 
                         y = nhs_reviews_dfm_ngram_train$review_positive,
                         prior = "docfreq")

## Test set accuracy

nhs_reviews_dfm_ngram_test$predicted_classification <- predict(nb_train_ngram, 
                                                  newdata = nhs_reviews_dfm_ngram_test, 
                                                  type = "class")

confusion_test_ngram <- table(predicted_classification = nhs_reviews_dfm_ngram_test$predicted_classification, 
                        true_classification = nhs_reviews_dfm_ngram_test$review_positive)

confusion_test_ngram_statistics <- confusionMatrix(confusion_test_ngram, positive = "Positive")

confusion_test_ngram_statistics

Confusion Matrix and Statistics

                        true_classification
predicted_classification Negative Positive
                Negative      169       35
                Positive        7      207
                                          
               Accuracy : 0.8995          
                 95% CI : (0.8666, 0.9266)
    No Information Rate : 0.5789          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7983          
                                          
 Mcnemar's Test P-Value : 3.097e-05       
                                          
            Sensitivity : 0.8554          
            Specificity : 0.9602          
         Pos Pred Value : 0.9673          
         Neg Pred Value : 0.8284          
             Prevalence : 0.5789          
         Detection Rate : 0.4952          
   Detection Prevalence : 0.5120          
      Balanced Accuracy : 0.9078          
                                          
       'Positive' Class : Positive

accuracy <- c(confusion_test_statistics$overall[1],
              confusion_test_nopunct_statistics$overall[1],
              confusion_test_ngram_statistics$overall[1])

sensitivity <- c(confusion_test_statistics$byClass[1],
                 confusion_test_nopunct_statistics$byClass[1],
                 confusion_test_ngram_statistics$byClass[1])

specificity <- c(confusion_test_statistics$byClass[2],
                 confusion_test_nopunct_statistics$byClass[2],
                 confusion_test_ngram_statistics$byClass[2])

model_names <- c("Unigram", "No punctuation", "N-gram")

results <- data.frame(model_names,
                      accuracy,
                      sensitivity,
                      specificity)

results

     model_names  accuracy sensitivity specificity
1        Unigram 0.9019139   0.8553719   0.9659091
2 No punctuation 0.9019139   0.8595041   0.9602273
3         N-gram 0.8995215   0.8553719   0.9602273

There is very little difference between the different feature selection approaches. This is largely because predicting positivity is a very straightforward task! The words indicating positivity are very clear and so we get a strong signal from this data. Note that this will not always be the case, especially we are trying to classify a more nuanced concept or a set of more fine-grained categories.

(Hard question) Cross-validation

In this question you should implement k-fold cross-validation to assess the accuracy, sensitivity, and specificity of one of the Naive Bayes models that you estimated above.

I have provided some starter code below for a function that will take a vector of logical values as input, which should be named held_out. That vector should take on TRUE values for observations in the held-out set and FALSE for observations in the training set for each fold of the cross-validation.

Your job is to fill in the Your code goes here! sections of the code. To do so, you can essentially copy the material on slide 57 of this week’s lecture, making modifications to the existing code where appropriate.

get_performance_scores <- function(held_out){
  
  # Set up train and test sets for this fold
  dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
  dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
  
  # Train model on everything except held-out fold
  
      # Your code goes here!
  
  # Predict for held-out fold
  
      # Your code goes here!
  
  # Calculate accuracy, specificity, sensitivity
  
      # Your code goes here!

  # Save results in a data.frame
  
  return(output)
  
}

get_performance_scores <- function(held_out){
  
  # Set up train and test sets for this fold
  dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
  dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
  
  # Train model on everything except held-out fold
  nb_train <- textmodel_nb(x = dfm_train, 
                         y = dfm_train$review_positive,
                         prior = "docfreq")
  
  # Predict for held-out fold
  dfm_test$predicted_classification <- predict(nb_train, 
                                             newdata = dfm_test, 
                                             type = "class")
  
  # Calculate accuracy, specificity, sensitivity
  confusion_nb <- table(predicted_classification = dfm_test$predicted_classification,
                        true_classification = dfm_test$review_positive)
  
  confusion_nb_statistics <- confusionMatrix(confusion_nb, positive = "Positive")
  
  accuracy <- confusion_nb_statistics$overall[1]
  sensitivity <- confusion_nb_statistics$byClass[1]
  specificity <- confusion_nb_statistics$byClass[2]
  
  return(data.frame(accuracy, sensitivity, specificity))
  
}

Once you have completed this function, you will then need to create a vector to represent the \(k\) folds on which you are going to test the data in the cross validation. To do so, you will need to use the sample() function to produce a vector that has the same number of elements as there are rows in the nhs_reviews_dfm object. Again, you can see how to do this in the lecture slides.

Reveal code

K <- 10
folds <- sample(1:K, nrow(nhs_reviews_dfm), replace = T)

Finally, you will need to apply the get_function_scores() function to each fold. To do so, you may find the lapply() function helpful. This function takes a vector as an input, applies a function to that vector, and returns a list object which includes the results of the function for each value of the vector that you input. Again, see the lecture for example code.

Reveal code

all_folds <- lapply(1:K, function(k) get_performance_scores(folds==k))

Report your result!

Reveal code

all_folds <- lapply(1:K, function(k) get_performance_scores(folds==k))
colMeans(bind_rows(all_folds))

   accuracy sensitivity specificity 
  0.9122027   0.8881802   0.9504991

3.1 Lecture slides

3.2 Classifying NHS reviews

3.2.1 Packages

3.2.2 Instructions for Mac users having problems loading quanteda.textmodels

3.2.3 Data

3.2.4 Random number generation

3.3 Training and Test Sets

3.4 Dictionaries

3.5 Naive Bayes

3.6 Homework

3.2.2 Instructions for Mac users having problems loading `quanteda.textmodels`