3.1 Lecture slides
3.2 Classifying NHS reviews
How satisfied are citizens with the healthcare services their governments provide? One interesting source of data on this question comes in the form of text, as users of healthcare services are often able to provide comments and feedback on the experiences they have when accessing government provided healthcare services. In this seminar, we will use Naive Bayes models to predict whether patient reviews of doctors’ surgeries in the UK are positive or negative. To do so, we will use a corpus of patient reviews of the UK’s National Health Service (NHS) facilities near to where they live.
3.2.1 Packages
You will need to load the following packages before beginning the assignment
library(tidyverse)
library(quanteda)
# Run the following code if you cannot load the quanteda.textmodels package
# ONLY WORKS FOR WINDOWS USERS, SEE COMMENTS BELOW FOR MAC USERS
# remotes::install_github("quanteda/quanteda.textmodels")
library(quanteda.textmodels)
# Run the following code if you cannot load the caret package
# install.packages("caret")
library(caret)
3.2.2 Instructions for Mac users having problems loading quanteda.textmodels
For Mac users, there is currently an issue with installing the quanteda.textmodels
package using either the install.packages()
function or the remotes::install_github()
function. Instead, you will need to use the following (somewhat labourious) workaround.
Install gfortran on your computer, finding the appropriate version at the following link: https://github.com/fxcoudert/gfortran-for-macOS/releases
Run these lines in R:
- Locate the Makevars file. To locate the relevant file, go to your home directory via finder (you can open finder and then press ⇧⌘H). Then, if you cannot see the
.R
folder, you first need to press ⇧⌘. which will reveal the hidden files on your mac. Locate the.R
folder, open it, and inside should be theMakevars
file. Once you have done that you need to copy-paste these lines into that file and save it:
Now restart R
Then run this:
3.2.3 Data
We will use one data source for today’s assignment.
- NHS Patient Reviews –
nhs_reviews.Rdata
This file contains 2000 reviews of NHS doctors’ surgeries across the UK. The data contains the following variables:
Variable | Description |
---|---|
review_title |
The title of the patient’s review |
review_text |
The text of the patient’s review |
star_rating |
The star rating (out of five) that the patient gave |
review_positive |
A categorical indicator equal to "Positive" if the patient gave 3 stars or more in their review, and "Negative" if they gave 1 or 2 stars |
review_date |
The date of the review |
gp_response |
A categorical variable which measures whether the doctors’ surgery provided a response to the patient’s review ("Responded" ) or has not yet provided a response ("Has not responded" ) |
Once you have downloaded this file and stored it somewhere sensible, you can load it into R using the following command:
As ever, you can take a quick look at the variables in the data by using the glimpse()
function from the tidyverse
package:
Rows: 2,000
Columns: 6
$ review_title <chr> "Great service", "Friendly helpful staff", "Excellent …
$ review_text <chr> "Phoned up to book a appointment receptionist very hel…
$ star_rating <dbl> 4, 5, 5, 1, 1, 5, 5, 5, 5, 1, 3, 5, 2, 5, 2, 5, 1, 1, …
$ review_positive <fct> Positive, Positive, Positive, Negative, Negative, Posi…
$ review_date <date> 2021-10-13, 2021-07-26, 2021-09-18, 2021-06-19, 2021-…
$ gp_response <chr> "Has not responded", "Responded", "Has not responded",…
3.2.4 Random number generation
This seminar requires you to randomly generate a training set and a test set. Any procedure which requires random number generation can cause the results you get to fluctuate each time you run the code. In this case, rerunning the code will result in different observations being allocated into the test and training sets, for example.
To make the results fully replicable, you will therefore first need to use the set.seed()
function, which takes as an argument a single integer value. Once you have specified a seed, R will generate the same sequence of random numbers every time you execute your code.
I have used 12345
here, but you can choose any value you like. If you choose a different number, then your results will be somewhat different to mine. However, so long as you execute this line of code whenever you run your R scripts, you will get the same results each time!
3.3 Training and Test Sets
- Create a new variable in the
nhs_reviews
data that indicates whether an observation is a training set or test set observation. Ensure that approximately 80% of your observations in the training set and 20% are in the test set. Label this variable astrain
Reveal code
The
sample()
function randomly samples from the elements provided to thex
argument. Here, we are sampling from a vector which has only two elements:TRUE
andFALSE
. Thesize
argument specifies the number of items to choose at random fromx
, and here we are asking to sample the same number of elements as there are observations innhs_reviews
data. Thereplace
argument specifies whether we want to sample fromx
with or without replacement (we want to sample with replacement here), and theprob
argument takes a vector which gives the probability with which we would like to sample each value ofx
.
- Convert the
nhs_reviews
data into a corpus and then into a DFM. Make some feature selection decisions.
Reveal code
Remember that there is no single “right” way to make the feature selection decisions. Here, I have simply used a unigram representation and have removed stop words. You may make different decisions!
- Use the
dfm_subset()
function to subset your DFM into a training DFM and a test DFM. Use thedim()
function to make sure that they have the numbers of rows and columns that you expect.
Reveal code
Here I use the logical operator
!
which reverses theTRUE
andFALSE
values of thetrain
vector. So, the elements oftrain
that areTRUE
becomeFALSE
when preceded by!
and the elements that areFALSE
becomeTRUE
.
Our DFMs contain different numbers of documents (as they should, given that we specified above that we wanted 80% of our data to be training data and 20% to be test data) but they contain the same number of features (again, as they should, since we will want to create predictions for our test-set observations).
3.4 Dictionaries
- Construct a simple dictionary using the
dictionary()
function which includes some words that you think are likely to capture positive sentiment in the patient reviews. Apply the dictionary to the training data DFM that you created above and create a binary variable which is equal toPositive
when the relevant text includes a word from your dictionary andNegative
otherwise.
Reveal code
## Create the dictionary
positive_dictionary <- dictionary(list(positive = c("great", "excellent", "fantastic", "thank")))
# Apply the dictionary to the DFM
nhs_dictionary_dfm_train <- dfm_lookup(x = nhs_reviews_dfm_train,
dictionary = positive_dictionary)
# Create a logical (TRUE/FALSE) vector with one element for each training set text
predicted_positive <- as.numeric(nhs_dictionary_dfm_train[,1]) > 0
# Convert the logical vector to a character vector, with TRUE entries equal to "Positive" and FALSE entries equal to "Negative". Assign the character vector to the training set data.
nhs_dictionary_dfm_train$predicted_positive_dictionary <- ifelse(predicted_positive, "Positive", "Negative")
- Use the
table()
function to create a confusion matrix which compares the predicted positive reviews (from your dictionary measure) to the true coding for positive reviews (which is stored in thereview_positive
variable). Save the output of thetable()
function as a new object.
Reveal code
- Use the
confusionMatrix()
function from thecaret
package to calculate a variety of performance statistics about your dictionary classifier.1 What is the accuracy of the predictions? What is the sensitivity? What is the specificity? Interpret these numbers. What do they tell us about the performance of the dictionary for classifying positive patient reviews?
1 You should also set the positive
argument equal to "Positive"
to tell R the level of the outcome that corresponds to a “positive” result.
Reveal code
confusion_dictionary_statistics <- confusionMatrix(confusion_dictionary,
positive = "Positive")
confusion_dictionary_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 542 569
Positive 36 421
Accuracy : 0.6142
95% CI : (0.5895, 0.6383)
No Information Rate : 0.6314
P-Value [Acc > NIR] : 0.9247
Kappa : 0.3045
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.4253
Specificity : 0.9377
Pos Pred Value : 0.9212
Neg Pred Value : 0.4878
Prevalence : 0.6314
Detection Rate : 0.2685
Detection Prevalence : 0.2915
Balanced Accuracy : 0.6815
'Positive' Class : Positive
Our dictionary produces predictions which have an accuracy of 61%. Although the specificity is high – 94% – this is because the dictionary is classifying the vast majority of reviews as not-positive, and so is correctly assigning that label to the true not-positive reviews. The sensitivity is much lower, at only 43%. That is, we are correctly classifying less than half of the true “positive” reviews correctly. This makes sense because the dictionary we are using above is very short and so is likely to be missing a lot of words that would indicate positivity.
We could try to improve these scores by changing the list of dictionary words and then trying again, but instead we will now use a Naive Bayes classification model to instead learn which words are most strongly associated with positive and negative patient reviews.
3.5 Naive Bayes
- Use the training DFM that you created earier to estimate a Naive Bayes model for the
review_positive
outcome using thetextmodel_nb()
function. This function takes three main arguments:
Argument | Description |
---|---|
x |
The DFM on which the Naive Bayes model is trained. |
y |
An outcome vector which we are trying to predict (here, review_positive ). |
prior |
The form of prior distribution for the categories of the outcome, y . Most typically we will set this to be "docfreq" which means that the prior probability for each category will be equal to the relative proportions of those classes in the training data. |
Reveal code
- Examine the class-conditional word probabilities from your estimated Naive Bayes model.2 You can access these by using the
coef()
function on the estimated model object. Which words have the highest probabilities of occurring under each class?3
2 You will find the sort()
function helpful here.
3 Note that your list of words might look different to the list here if you used different feature selection decisions at the beginning of the assignment. In particular, if you did not remove stopwords then you will find that both classes have high probability of those words (as they should, given that they are indeed common across all patient review texts).
Reveal code
. , staff practice surgery appointment
0.075829687 0.037050475 0.009988262 0.009241276 0.008921140 0.007619251
always gp service ! time helpful
0.006061253 0.006061253 0.005997225 0.005890513 0.005762459 0.005719774
thank doctor care well doctors friendly
0.005463664 0.004652652 0.004545940 0.004481912 0.004183118 0.004076406
reception get
0.003670900 0.003606872
. , appointment get call surgery
0.067243057 0.032460640 0.012294357 0.011874226 0.009021758 0.008667964
! told phone practice time gp
0.008093048 0.007584468 0.007473908 0.006147178 0.006036618 0.005771272
doctor day back one can just
0.004908898 0.004555103 0.004333982 0.004245533 0.004134973 0.004134973
appointments ?
0.004112860 0.004046524
These words largely make sense. In both positive and negative reviews, patients are likely to use language related to healthcare – “staff”, “practice”, “surgery”, “appointment”, “doctor”, etc – but the language model for the “Positive” category also puts higher probabilities on many positively-valanced words such as “thank”, “helpful”, and “friendly”. By contrast, the negative category puts high probabilities on words that are likely to be linked to unsatisfactory service such as “time”, “service”, “day”, and “!”.
- Estimate the predicted probability of a positive review from the Naive Bayes model. Use these probabilities to examine the
review_title
variable: which patient review titles are most likely to be positive reviews? Which are most likely to be negative? Do these make sense?
Reveal code
nhs_reviews_dfm_train$positive_nb_probability <- predict(nb_train, type = "probability")[,2]
nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = T)[1:10]]
[1] "Good GP Practice" "Efficient and compassionate service"
[3] "Fantastic practice" "Caring and professional service"
[5] "Nurse was amazing" "Fantastic Practice"
[7] "Consistently good care" "Exemplary Practice"
[9] "Fantastic practice" "Fantastic Practice"
nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = F)[1:10]]
[1] "Shambles"
[2] "Unacceptable practice"
[3] "Impossible to get a service"
[4] "Unprofessional, Poor and Dangerous"
[5] "Impossible to get an appointment!!!"
[6] "Honestly, you’re better off not having a GP"
[7] "Disappointed with appointments."
[8] "Not so great practice!!"
[9] "Very bad practice and arrogant receptionists"
[10] "Extremely rude receptionist"
Yes! It is very clear from these lists that the model is doing a reasonably good job of distinguishing between positive and negative reviews. To find out how good a job it is doing, we need to do more than a face validity check.
- Use your fitted Naive Bayes model to predict the category (positive or negative) for each observation in the training data. Then use the
table()
function to create a confusion matrix which compares the predicted positive reviews to the true coding for positive reviews (which is stored in thereview_positive
variable). As with the dictionary analysis, save the output of thetable()
function as a new object.
Reveal code
## Training set accuracy
nhs_reviews_dfm_train$predicted_classification_nb <- predict(nb_train, type = "class")
confusion_train <- table(predicted_classification = nhs_reviews_dfm_train$predicted_classification_nb,
true_classification = nhs_reviews_dfm_train$review_positive)
confusion_train
true_classification
predicted_classification Negative Positive
Negative 565 59
Positive 13 931
- Calculate the accuracy, sensitivity and specificity of your classifier using the
confusionMatrix()
function. Compare these scores to those that you obtained from the dictionary analysis. Which performs best?
Reveal code
confusion_train_statistics <- confusionMatrix(confusion_train,
positive = "Positive")
confusion_train_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 565 59
Positive 13 931
Accuracy : 0.9541
95% CI : (0.9425, 0.9639)
No Information Rate : 0.6314
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.903
Mcnemar's Test P-Value : 1.137e-07
Sensitivity : 0.9404
Specificity : 0.9775
Pos Pred Value : 0.9862
Neg Pred Value : 0.9054
Prevalence : 0.6314
Detection Rate : 0.5938
Detection Prevalence : 0.6020
Balanced Accuracy : 0.9590
'Positive' Class : Positive
The Naive Bayes model clearly outperforms the dictionary analysis. The sensitivity of the classifier is now 94%, and the accuracy is 95%. We are doing a much better job at detecting sentiment in the training set using the Naive Bayes approach.
- Repeat the assessment of classifier accuracy but now for the test set observations. How does the classifier perform on these data compared to its performance on the training data?
Reveal code
## Test set accuracy
nhs_reviews_dfm_test$predicted_classification_nb <- predict(nb_train,
newdata = nhs_reviews_dfm_test,
type = "class")
confusion_test <- table(predicted_classification = nhs_reviews_dfm_test$predicted_classification_nb,
true_classification = nhs_reviews_dfm_test$review_positive)
confusion_test_statistics <- confusionMatrix(confusion_test,
positive = "Positive")
confusion_test_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 170 35
Positive 6 207
Accuracy : 0.9019
95% CI : (0.8693, 0.9287)
No Information Rate : 0.5789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8032
Mcnemar's Test P-Value : 1.226e-05
Sensitivity : 0.8554
Specificity : 0.9659
Pos Pred Value : 0.9718
Neg Pred Value : 0.8293
Prevalence : 0.5789
Detection Rate : 0.4952
Detection Prevalence : 0.5096
Balanced Accuracy : 0.9106
'Positive' Class : Positive
The accuracy on the test set is 90%, with a sensitivity of 86% and a specificity of 97%. The performance on the test set is a fraction worse than on the training set, though it is still significantly better than the dictionary analysis.
3.6 Homework
Create two new DFMs from the
nhs_reviews
data, each time making different feature selection decisions (i.e. trimming, n-grams, stopword removal, etc). Train two new Naive Bayes models using these DFMs and compare their performance on the test set data to the performance of the model that you created above.Create a table comparing the results the models you have estimated. Report the feature selection decisions you made for each model. Which performs best in terms of accuracy, sensitivity and specificity ?
Upload your table to this Moodle page.
Reveal code
## No punctuation DFM
# Convert to a DFM
nhs_reviews_dfm_nopunct <- nhs_reviews_corpus %>%
tokens(remove_punct = T) %>%
dfm()
# Subset to training set observations
nhs_reviews_dfm_nopunct_train <- dfm_subset(nhs_reviews_dfm_nopunct, train)
# Subset to test set observations
nhs_reviews_dfm_nopunct_test <- dfm_subset(nhs_reviews_dfm_nopunct, !train)
# Train a new Naive Bayes model
nb_train_nopunct <- textmodel_nb(x = nhs_reviews_dfm_nopunct_train,
y = nhs_reviews_dfm_nopunct_train$review_positive,
prior = "docfreq")
## Test set accuracy
nhs_reviews_dfm_nopunct_test$predicted_classification <- predict(nb_train_nopunct,
newdata = nhs_reviews_dfm_nopunct_test,
type = "class")
confusion_test_nopunct <- table(predicted_classification = nhs_reviews_dfm_nopunct_test$predicted_classification,
true_classification = nhs_reviews_dfm_nopunct_test$review_positive)
confusion_test_nopunct_statistics <- confusionMatrix(confusion_test_nopunct, positive = "Positive")
confusion_test_nopunct_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 169 34
Positive 7 208
Accuracy : 0.9019
95% CI : (0.8693, 0.9287)
No Information Rate : 0.5789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8029
Mcnemar's Test P-Value : 4.896e-05
Sensitivity : 0.8595
Specificity : 0.9602
Pos Pred Value : 0.9674
Neg Pred Value : 0.8325
Prevalence : 0.5789
Detection Rate : 0.4976
Detection Prevalence : 0.5144
Balanced Accuracy : 0.9099
'Positive' Class : Positive
## N-gram model
# Convert to a DFM
nhs_reviews_dfm_ngram <- nhs_reviews_corpus %>%
tokens() %>%
tokens_ngrams(1:3) %>%
dfm()
# Subset to training set observations
nhs_reviews_dfm_ngram_train <- dfm_subset(nhs_reviews_dfm_ngram, train)
# Subset to test set observations
nhs_reviews_dfm_ngram_test <- dfm_subset(nhs_reviews_dfm_ngram, !train)
# Train a new Naive Bayes model
nb_train_ngram <- textmodel_nb(x = nhs_reviews_dfm_ngram_train,
y = nhs_reviews_dfm_ngram_train$review_positive,
prior = "docfreq")
## Test set accuracy
nhs_reviews_dfm_ngram_test$predicted_classification <- predict(nb_train_ngram,
newdata = nhs_reviews_dfm_ngram_test,
type = "class")
confusion_test_ngram <- table(predicted_classification = nhs_reviews_dfm_ngram_test$predicted_classification,
true_classification = nhs_reviews_dfm_ngram_test$review_positive)
confusion_test_ngram_statistics <- confusionMatrix(confusion_test_ngram, positive = "Positive")
confusion_test_ngram_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 169 35
Positive 7 207
Accuracy : 0.8995
95% CI : (0.8666, 0.9266)
No Information Rate : 0.5789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7983
Mcnemar's Test P-Value : 3.097e-05
Sensitivity : 0.8554
Specificity : 0.9602
Pos Pred Value : 0.9673
Neg Pred Value : 0.8284
Prevalence : 0.5789
Detection Rate : 0.4952
Detection Prevalence : 0.5120
Balanced Accuracy : 0.9078
'Positive' Class : Positive
accuracy <- c(confusion_test_statistics$overall[1],
confusion_test_nopunct_statistics$overall[1],
confusion_test_ngram_statistics$overall[1])
sensitivity <- c(confusion_test_statistics$byClass[1],
confusion_test_nopunct_statistics$byClass[1],
confusion_test_ngram_statistics$byClass[1])
specificity <- c(confusion_test_statistics$byClass[2],
confusion_test_nopunct_statistics$byClass[2],
confusion_test_ngram_statistics$byClass[2])
model_names <- c("Unigram", "No punctuation", "N-gram")
results <- data.frame(model_names,
accuracy,
sensitivity,
specificity)
results
model_names accuracy sensitivity specificity
1 Unigram 0.9019139 0.8553719 0.9659091
2 No punctuation 0.9019139 0.8595041 0.9602273
3 N-gram 0.8995215 0.8553719 0.9602273
There is very little difference between the different feature selection approaches. This is largely because predicting positivity is a very straightforward task! The words indicating positivity are very clear and so we get a strong signal from this data. Note that this will not always be the case, especially we are trying to classify a more nuanced concept or a set of more fine-grained categories.
- (Hard question) Cross-validation
In this question you should implement k-fold cross-validation to assess the accuracy, sensitivity, and specificity of one of the Naive Bayes models that you estimated above.
I have provided some starter code below for a function that will take a vector of logical values as input, which should be named
held_out
. That vector should take onTRUE
values for observations in the held-out set andFALSE
for observations in the training set for each fold of the cross-validation.Your job is to fill in the
Your code goes here!
sections of the code. To do so, you can essentially copy the material on slide 57 of this week’s lecture, making modifications to the existing code where appropriate.
get_performance_scores <- function(held_out){
# Set up train and test sets for this fold
dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
# Train model on everything except held-out fold
# Your code goes here!
# Predict for held-out fold
# Your code goes here!
# Calculate accuracy, specificity, sensitivity
# Your code goes here!
# Save results in a data.frame
return(output)
}
get_performance_scores <- function(held_out){
# Set up train and test sets for this fold
dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
# Train model on everything except held-out fold
nb_train <- textmodel_nb(x = dfm_train,
y = dfm_train$review_positive,
prior = "docfreq")
# Predict for held-out fold
dfm_test$predicted_classification <- predict(nb_train,
newdata = dfm_test,
type = "class")
# Calculate accuracy, specificity, sensitivity
confusion_nb <- table(predicted_classification = dfm_test$predicted_classification,
true_classification = dfm_test$review_positive)
confusion_nb_statistics <- confusionMatrix(confusion_nb, positive = "Positive")
accuracy <- confusion_nb_statistics$overall[1]
sensitivity <- confusion_nb_statistics$byClass[1]
specificity <- confusion_nb_statistics$byClass[2]
return(data.frame(accuracy, sensitivity, specificity))
}
- Once you have completed this function, you will then need to create a vector to represent the \(k\) folds on which you are going to test the data in the cross validation. To do so, you will need to use the
sample()
function to produce a vector that has the same number of elements as there are rows in thenhs_reviews_dfm
object. Again, you can see how to do this in the lecture slides.
- Finally, you will need to apply the
get_function_scores()
function to each fold. To do so, you may find thelapply()
function helpful. This function takes a vector as an input, applies a function to that vector, and returns a list object which includes the results of the function for each value of the vector that you input. Again, see the lecture for example code.
- Report your result!