3.1 Classifying NHS reviews
How satisfied are citizens with the healthcare services their governments provide? One interesting source of data on this question comes in the form of text, as users of healthcare services are often able to provide comments and feedback on the experiences they have when accessing government provided healthcare services. In this seminar, we will use Naive Bayes models to predict whether patient reviews of doctors’ surgeries in the UK are positive or negative. To do so, we will use a corpus of patient reviews of the UK’s National Health Service (NHS) facilities near to where they live.
3.1.1 Data
We will use one data source for today’s assignment.
- NHS Patient Reviews –
nhs_reviews.Rdata
This file contains 2000 reviews of NHS doctors’ surgeries across the UK. The data contains the following variables:
Variable | Description |
---|---|
review_title |
The title of the patient’s review |
review_text |
The text of the patient’s review |
star_rating |
The star rating (out of five) that the patient gave |
review_positive |
A categorical indicator equal to "Positive" if the patient gave 3 stars or more in their review, and "Negative" if they gave 1 or 2 stars |
review_date |
The date of the review |
gp_response |
A categorical variable which measures whether the doctors’ surgery provided a response to the patient’s review ("Responded" ) or has not yet provided a response ("Has not responded" ) |
Once you have downloaded this file and stored it somewhere sensible, you can load it into R using the following command:
As ever, you can take a quick look at the variables in the data by using the glimpse()
function from the tidyverse
package:
Rows: 2,000
Columns: 6
$ review_title <chr> "Great service", "Friendly helpful staff", "Excellent …
$ review_text <chr> "Phoned up to book a appointment receptionist very hel…
$ star_rating <dbl> 4, 5, 5, 1, 1, 5, 5, 5, 5, 1, 3, 5, 2, 5, 2, 5, 1, 1, …
$ review_positive <fct> Positive, Positive, Positive, Negative, Negative, Posi…
$ review_date <date> 2021-10-13, 2021-07-26, 2021-09-18, 2021-06-19, 2021-…
$ gp_response <chr> "Has not responded", "Responded", "Has not responded",…
3.1.2 Packages
You will need to load the following packages before beginning the assignment
3.1.3 Random number generation
This seminar requires you to randomly generate a training set and a test set. Any procedure which requires random number generation can cause the results you get to fluctuate each time you run the code. In this case, rerunning the code will result in different observations being allocated into the test and training sets, for example.
To make the results fully replicable, you will therefore first need to use the set.seed()
function, which takes as an argument a single integer value. Once you have specified a seed, R will generate the same sequence of random numbers every time you execute your code.
I have used 12345
here, but you can choose any value you like. If you choose a different number, then your results will be somewhat different to mine. However, so long as you execute this line of code whenever you run your R scripts, you will get the same results each time!
3.2 Training and Test Sets
- Create a new variable in the
nhs_reviews
data that indicates whether an observation is a training set or test set observation. Ensure that approximately 80% of your observations in the training set and 20% are in the test set. Label this variable astrain
Reveal code
The
sample()
function randomly samples from the elements provided to thex
argument. Here, we are sampling from a vector which has only two elements:TRUE
andFALSE
. Thesize
argument specifies the number of items to choose at random fromx
, and here we are asking to sample the same number of elements as there are observations innhs_reviews
data. Thereplace
argument specifies whether we want to sample fromx
with or without replacement (we want to sample with replacement here), and theprob
argument takes a vector which gives the probability with which we would like to sample each value ofx
.
- Convert the
nhs_reviews
data into a corpus and then into a DFM. Make some feature selection decisions.
Reveal code
Remember that there is no single “right” way to make the feature selection decisions. Here, I have simply used a unigram representation and have removed stop words. You may make different decisions!
- Use the
dfm_subset()
function to subset your DFM into a training DFM and a test DFM. Use thedim()
function to make sure that they have the numbers of rows and columns that you expect.
Reveal code
Here I use the logical operator
!
which reverses theTRUE
andFALSE
values of thetrain
vector. So, the elements oftrain
that areTRUE
becomeFALSE
when preceded by!
and the elements that areFALSE
becomeTRUE
.
Our DFMs contain different numbers of documents (as they should, given that we specified above that we wanted 80% of our data to be training data and 20% to be test data) but they contain the same number of features (again, as they should, since we will want to create predictions for our test-set observations).
3.3 Dictionaries
- Construct a simple dictionary using the
dictionary()
function which includes some words that you think are likely to capture positive sentiment in the patient reviews. Apply the dictionary to the training data DFM that you created above and create a binary variable which is equal toPositive
when the relevant text includes a word from your dictionary andNegative
otherwise.
Reveal code
## Create the dictionary
positive_dictionary <- dictionary(list(positive = c("great", "excellent", "fantastic", "thank")))
# Apply the dictionary to the DFM
nhs_dictionary_dfm_train <- dfm_lookup(x = nhs_reviews_dfm_train,
dictionary = positive_dictionary)
# Create a logical (TRUE/FALSE) vector with one element for each training set text
predicted_positive <- as.numeric(nhs_dictionary_dfm_train[,1]) > 0
# Convert the logical vector to a character vector, with TRUE entries equal to "Positive" and FALSE entries equal to "Negative". Assign the character vector to the training set data.
nhs_dictionary_dfm_train$predicted_positive_dictionary <- ifelse(predicted_positive, "Positive", "Negative")
- Use the
table()
function to create a confusion matrix which compares the predicted positive reviews (from your dictionary measure) to the true coding for positive reviews (which is stored in thereview_positive
variable). Save the output of thetable()
function as a new object.
Reveal code
- Use the
confusionMatrix()
function from thecaret
package to calculate a variety of performance statistics about your dictionary classifier.1 What is the accuracy of the predictions? What is the sensitivity? What is the specificity? Interpret these numbers. What do they tell us about the performance of the dictionary for classifying positive patient reviews?
1 You should also set the positive
argument equal to "Positive"
to tell R the level of the outcome that corresponds to a “positive” result.
Reveal code
confusion_dictionary_statistics <- confusionMatrix(confusion_dictionary,
positive = "Positive")
confusion_dictionary_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 542 569
Positive 36 421
Accuracy : 0.6142
95% CI : (0.5895, 0.6383)
No Information Rate : 0.6314
P-Value [Acc > NIR] : 0.9247
Kappa : 0.3045
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.4253
Specificity : 0.9377
Pos Pred Value : 0.9212
Neg Pred Value : 0.4878
Prevalence : 0.6314
Detection Rate : 0.2685
Detection Prevalence : 0.2915
Balanced Accuracy : 0.6815
'Positive' Class : Positive
Our dictionary produces predictions which have an accuracy of 61%. Although the specificity is high – 94% – this is because the dictionary is classifying the vast majority of reviews as not-positive, and so is correctly assigning that label to the true not-positive reviews. The sensitivity is much lower, at only 43%. That is, we are correctly classifying less than half of the true “positive” reviews correctly. This makes sense because the dictionary we are using above is very short and so is likely to be missing a lot of words that would indicate positivity.
We could try to improve these scores by changing the list of dictionary words and then trying again, but instead we will now use a Naive Bayes classification model to instead learn which words are most strongly associated with positive and negative patient reviews.
3.4 Naive Bayes
- Use the training DFM that you created earier to estimate a Naive Bayes model for the
review_positive
outcome using thetextmodel_nb()
function. This function takes three main arguments:
Argument | Description |
---|---|
x |
The DFM on which the Naive Bayes model is trained. |
y |
An outcome vector which we are trying to predict (here, review_positive ). |
prior |
The form of prior distribution for the categories of the outcome, y . Most typically we will set this to be "docfreq" which means that the prior probability for each category will be equal to the relative proportions of those classes in the training data. |
Reveal code
- Examine the class-conditional word probabilities from your estimated Naive Bayes model.2 You can access these by using the
coef()
function on the estimated model object. Which words have the highest probabilities of occurring under each class?3
2 You will find the sort()
function helpful here.
3 Note that your list of words might look different to the list here if you used different feature selection decisions at the beginning of the assignment. In particular, if you did not remove stopwords then you will find that both classes have high probability of those words (as they should, given that they are indeed common across all patient review texts).
Reveal code
. , staff practice surgery appointment
0.075730102 0.037033087 0.009983574 0.009215607 0.008916953 0.007594343
always gp service ! time helpful
0.006058408 0.006037076 0.005994411 0.005887749 0.005759754 0.005717089
thank doctor care well doctors friendly
0.005461100 0.004650468 0.004543806 0.004479809 0.004181155 0.004074493
reception get
0.003669177 0.003605180
. , appointment get call surgery
0.067154285 0.032439121 0.012286207 0.011866354 0.009015778 0.008662218
! told phone practice time gp
0.008087683 0.007579440 0.007468953 0.006143103 0.006032616 0.005767446
doctor day back one can just
0.004905644 0.004552084 0.004331109 0.004242719 0.004132231 0.004132231
appointments ?
0.004088036 0.004043841
These words largely make sense. In both positive and negative reviews, patients are likely to use language related to healthcare – “staff”, “practice”, “surgery”, “appointment”, “doctor”, etc – but the language model for the “Positive” category also puts higher probabilities on many positively-valanced words such as “thank”, “helpful”, and “friendly”. By contrast, the negative category puts high probabilities on words that are likely to be linked to unsatisfactory service such as “time”, “service”, “day”, and “!”.
- Estimate the predicted probability of a positive review from the Naive Bayes model. Use these probabilities to examine the
review_title
variable: which patient review titles are most likely to be positive reviews? Which are most likely to be negative? Do these make sense?
Reveal code
nhs_reviews_dfm_train$positive_nb_probability <- predict(nb_train, type = "probability")[,2]
nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = T)[1:10]]
[1] "Good GP Practice" "Efficient and compassionate service"
[3] "Fantastic practice" "Caring and professional service"
[5] "Nurse was amazing" "Fantastic Practice"
[7] "Consistently good care" "Exemplary Practice"
[9] "Fantastic practice" "Fantastic Practice"
nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = F)[1:10]]
[1] "Shambles"
[2] "Unacceptable practice"
[3] "Impossible to get a service"
[4] "Unprofessional, Poor and Dangerous"
[5] "Impossible to get an appointment!!!"
[6] "Honestly, you’re better off not having a GP"
[7] "Disappointed with appointments."
[8] "Not so great practice!!"
[9] "Very bad practice and arrogant receptionists"
[10] "Extremely rude receptionist"
Yes! It is very clear from these lists that the model is doing a reasonably good job of distinguishing between positive and negative reviews. To find out how good a job it is doing, we need to do more than a face validity check.
- Use your fitted Naive Bayes model to predict the category (positive or negative) for each observation in the training data. Then use the
table()
function to create a confusion matrix which compares the predicted positive reviews to the true coding for positive reviews (which is stored in thereview_positive
variable). As with the dictionary analysis, save the output of thetable()
function as a new object.
Reveal code
## Training set accuracy
nhs_reviews_dfm_train$predicted_classification_nb <- predict(nb_train, type = "class")
confusion_train <- table(predicted_classification = nhs_reviews_dfm_train$predicted_classification_nb,
true_classification = nhs_reviews_dfm_train$review_positive)
confusion_train
true_classification
predicted_classification Negative Positive
Negative 565 58
Positive 13 932
- Calculate the accuracy, sensitivity and specificity of your classifier using the
confusionMatrix()
function. Compare these scores to those that you obtained from the dictionary analysis. Which performs best?
Reveal code
confusion_train_statistics <- confusionMatrix(confusion_train,
positive = "Positive")
confusion_train_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 565 58
Positive 13 932
Accuracy : 0.9547
95% CI : (0.9432, 0.9645)
No Information Rate : 0.6314
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9043
Mcnemar's Test P-Value : 1.772e-07
Sensitivity : 0.9414
Specificity : 0.9775
Pos Pred Value : 0.9862
Neg Pred Value : 0.9069
Prevalence : 0.6314
Detection Rate : 0.5944
Detection Prevalence : 0.6027
Balanced Accuracy : 0.9595
'Positive' Class : Positive
The Naive Bayes model clearly outperforms the dictionary analysis. The sensitivity of the classifier is now 94%, and the accuracy is 95%. We are doing a much better job at detecting sentiment in the training set using the Naive Bayes approach.
- Repeat the assessment of classifier accuracy but now for the test set observations. How does the classifier perform on these data compared to its performance on the training data?
Reveal code
## Test set accuracy
nhs_reviews_dfm_test$predicted_classification_nb <- predict(nb_train,
newdata = nhs_reviews_dfm_test,
type = "class")
confusion_test <- table(predicted_classification = nhs_reviews_dfm_test$predicted_classification_nb,
true_classification = nhs_reviews_dfm_test$review_positive)
confusion_test_statistics <- confusionMatrix(confusion_test,
positive = "Positive")
confusion_test_statistics
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 170 35
Positive 6 207
Accuracy : 0.9019
95% CI : (0.8693, 0.9287)
No Information Rate : 0.5789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8032
Mcnemar's Test P-Value : 1.226e-05
Sensitivity : 0.8554
Specificity : 0.9659
Pos Pred Value : 0.9718
Neg Pred Value : 0.8293
Prevalence : 0.5789
Detection Rate : 0.4952
Detection Prevalence : 0.5096
Balanced Accuracy : 0.9106
'Positive' Class : Positive
The accuracy on the test set is 90%, with a sensitivity of 86% and a specificity of 97%. The performance on the test set is a fraction worse than on the training set, though it is still significantly better than the dictionary analysis.
3.5 Homework
Create two new DFMs from the
nhs_reviews
data, each time making different feature selection decisions (i.e. trimming, n-grams, stopword removal, etc). Train two new Naive Bayes models using these DFMs and compare their performance on the test set data to the performance of the model that you created above.Create a table comparing the results the models you have estimated. Report the feature selection decisions you made for each model. Which performs best in terms of accuracy, sensitivity and specificity ?
Upload your table to this Moodle page.
- (Hard question) Cross-validation
In this question you should implement k-fold cross-validation to assess the accuracy, sensitivity, and specificity of one of the Naive Bayes models that you estimated above.
I have provided some starter code below for a function that will take a vector of logical values as input, which should be named
held_out
. That vector should take onTRUE
values for observations in the held-out set andFALSE
for observations in the training set for each fold of the cross-validation.Your job is to fill in the
Your code goes here!
sections of the code. To do so, you can essentially copy the material on slide 58 of this week’s lecture, making modifications to the existing task where appropriate.
get_performance_scores <- function(held_out){
# Set up train and test sets for this fold
dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
# Train model on everything except held-out fold
# Your code goes here!
# Predict for held-out fold
# Your code goes here!
# Calculate accuracy, specificity, sensitivity
# Your code goes here!
# Save results in a data.frame
return(output)
}
- Once you have completed this function, you will then need to create a vector to represent the \(k\) folds on which you are going to test the data in the cross validation. To do so, you will need to use the
sample()
function to produce a vector that has the same number of elements as there are rows in thenhs_reviews_dfm
object. Again, you can see how to do this in the lecture slides.
- Finally, you will need to apply the
get_function_scores()
function to each fold. To do so, you may find thelapply()
function helpful. This function takes a vector as an input, applies a function to that vector, and returns a list object which includes the results of the function for each value of the vector that you input. Again, see the lecture for example code.
- Report your result!