## 4.1 Classifying NHS reviews

How satisfied are citizens with the healthcare services their governments provide? One interesting source of data on this question comes in the form of text, as users of healthcare services are often able to provide comments and feedback on the experiences they have when accessing government provided healthcare services. In this seminar, we will use Naive Bayes models to predict whether patient reviews of doctors’ surgeries in the UK are positive or negative. To do so, we will use a corpus of patient reviews of the UK’s National Health Service (NHS) facilities near to where they live.

### 4.1.1 Data

We will use one data source for today’s assignment.

**NHS Patient Reviews**–`nhs_reviews.Rdata`

This file contains 2000 reviews of NHS doctors’ surgeries across the UK. The data contains the following variables:

Variable | Description |
---|---|

`review_title` |
The title of the patient’s review |

`review_text` |
The text of the patient’s review |

`star_rating` |
The star rating (out of five) that the patient gave |

`review_positive` |
A categorical indicator equal to `"Positive"` if the patient gave 3 stars or more in their review, and `"Negative"` if they gave 1 or 2 stars |

`review_date` |
The date of the review |

`gp_response` |
A categorical variable which measures whether the doctors’ surgery provided a response to the patient’s review (`"Responded"` ) or has not yet provided a response (`"Has not responded"` ) |

Once you have downloaded this file and stored it somewhere sensible, you can load it into R using the following command:

As ever, you can take a quick look at the variables in the data by using the `glimpse()`

function from the `tidyverse`

package:

```
Rows: 2,000
Columns: 6
$ review_title <chr> "Great service", "Friendly helpful staff", "Excellent …
$ review_text <chr> "Phoned up to book a appointment receptionist very hel…
$ star_rating <dbl> 4, 5, 5, 1, 1, 5, 5, 5, 5, 1, 3, 5, 2, 5, 2, 5, 1, 1, …
$ review_positive <fct> Positive, Positive, Positive, Negative, Negative, Posi…
$ review_date <date> 2021-10-13, 2021-07-26, 2021-09-18, 2021-06-19, 2021-…
$ gp_response <chr> "Has not responded", "Responded", "Has not responded",…
```

### 4.1.2 Packages

You will need to load the following packages before beginning the assignment

### 4.1.3 Random number generation

This seminar requires you to randomly generate a training set and a test set. Any procedure which requires random number generation can cause the results you get to fluctuate each time you run the code. In this case, rerunning the code will result in different observations being allocated into the test and training sets, for example.

To make the results fully replicable, you will therefore first need to use the `set.seed()`

function, which takes as an argument a single integer value. Once you have specified a seed, R will generate the same sequence of random numbers every time you execute your code.

I have used `12345`

here, but you can choose any value you like. If you choose a different number, then your results will be somewhat different to mine. However, so long as you execute this line of code whenever you run your R scripts, you will get the same results each time!

## 4.2 Training and Test Sets

- Create a new variable in the
`nhs_reviews`

data that indicates whether an observation is a training set or test set observation. Ensure that approximately 80% of your observations in the training set and 20% are in the test set. Label this variable as`train`

## Reveal code

The

`sample()`

function randomly samples from the elements provided to the`x`

argument. Here, we are sampling from a vector which has only two elements:`TRUE`

and`FALSE`

. The`size`

argument specifies the number of items to choose at random from`x`

, and here we are asking to sample the same number of elements as there are observations in`nhs_reviews`

data. The`replace`

argument specifies whether we want to sample from`x`

with or without replacement (we want to sample with replacement here), and the`prob`

argument takes a vector which gives the probability with which we would like to sample each value of`x`

.

- Convert the
`nhs_reviews`

data into a corpus and then into a DFM. Make some feature selection decisions.

## Reveal code

Remember that there is no single “right” way to make the feature selection decisions. Here, I have simply used a unigram representation and have removed stop words. You may make different decisions!

- Use the
`dfm_subset()`

function to subset your DFM into a training DFM and a test DFM. Use the`dim()`

function to make sure that they have the numbers of rows and columns that you expect.

## Reveal code

Here I use the logical operator

`!`

which reverses the`TRUE`

and`FALSE`

values of the`train`

vector. So, the elements of`train`

that are`TRUE`

become`FALSE`

when preceded by`!`

and the elements that are`FALSE`

become`TRUE`

.

Our DFMs contain different numbers of documents (as they should, given that we specified above that we wanted 80% of our data to be training data and 20% to be test data) but they contain the same number of features (again, as they should, since we will want to create predictions for our test-set observations).

## 4.3 Dictionaries

- Construct a simple dictionary using the
`dictionary()`

function which includes some words that you think are likely to capture positive sentiment in the patient reviews. Apply the dictionary to the training data DFM that you created above and create a binary variable which is equal to`Positive`

when the relevant text includes a word from your dictionary and`Negative`

otherwise.

## Reveal code

```
## Create the dictionary
positive_dictionary <- dictionary(list(positive = c("great", "excellent", "fantastic", "thank")))
# Apply the dictionary to the DFM
nhs_dictionary_dfm_train <- dfm_lookup(x = nhs_reviews_dfm_train,
dictionary = positive_dictionary)
# Create a logical (TRUE/FALSE) vector with one element for each training set text
predicted_positive <- as.numeric(nhs_dictionary_dfm_train[,1]) > 0
# Convert the logical vector to a character vector, with TRUE entries equal to "Positive" and FALSE entries equal to "Negative". Assign the character vector to the training set data.
nhs_dictionary_dfm_train$predicted_positive_dictionary <- ifelse(predicted_positive, "Positive", "Negative")
```

- Use the
`table()`

function to create a confusion matrix which compares the predicted positive reviews (from your dictionary measure) to the true coding for positive reviews (which is stored in the`review_positive`

variable). Save the output of the`table()`

function as a new object.

## Reveal code

- Use the
`confusionMatrix()`

function from the`caret`

package to calculate a variety of performance statistics about your dictionary classifier.^{1}What is the accuracy of the predictions? What is the sensitivity? What is the specificity? Interpret these numbers. What do they tell us about the performance of the dictionary for classifying positive patient reviews?

^{1} You should also set the `positive`

argument equal to `"Positive"`

to tell R the level of the outcome that corresponds to a “positive” result.

## Reveal code

```
confusion_dictionary_statistics <- confusionMatrix(confusion_dictionary,
positive = "Positive")
confusion_dictionary_statistics
```

```
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 542 569
Positive 36 421
Accuracy : 0.6142
95% CI : (0.5895, 0.6383)
No Information Rate : 0.6314
P-Value [Acc > NIR] : 0.9247
Kappa : 0.3045
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.4253
Specificity : 0.9377
Pos Pred Value : 0.9212
Neg Pred Value : 0.4878
Prevalence : 0.6314
Detection Rate : 0.2685
Detection Prevalence : 0.2915
Balanced Accuracy : 0.6815
'Positive' Class : Positive
```

Our dictionary produces predictions which have an accuracy of 61%. Although the specificity is high – 94% – this is because the dictionary is classifying the vast majority of reviews as not-positive, and so is correctly assigning that label to the true not-positive reviews. The sensitivity is much lower, at only 43%. That is, we are correctly classifying less than half of the true “positive” reviews correctly. This makes sense because the dictionary we are using above is very short and so is likely to be missing a lot of words that would indicate positivity.

We could try to improve these scores by changing the list of dictionary words and then trying again, but instead we will now use a Naive Bayes classification model to instead

learnwhich words are most strongly associated with positive and negative patient reviews.

## 4.4 Naive Bayes

- Use the training DFM that you created earier to estimate a Naive Bayes model for the
`review_positive`

outcome using the`textmodel_nb()`

function. This function takes three main arguments:

Argument | Description |
---|---|

`x` |
The DFM on which the Naive Bayes model is trained. |

`y` |
An outcome vector which we are trying to predict (here, `review_positive` ). |

`prior` |
The form of prior distribution for the categories of the outcome, `y` . Most typically we will set this to be `"docfreq"` which means that the prior probability for each category will be equal to the relative proportions of those classes in the training data. |

## Reveal code

- Examine the class-conditional word probabilities from your estimated Naive Bayes model.
^{2}You can access these by using the`coef()`

function on the estimated model object. Which words have the highest probabilities of occurring under each class?^{3}

^{2} You will find the `sort()`

function helpful here.

^{3} Note that your list of words might look different to the list here if you used different feature selection decisions at the beginning of the assignment. In particular, if you did not remove stopwords then you will find that both classes have high probability of those words (as they should, given that they are indeed common across all patient review texts).

## Reveal code

```
. , staff practice surgery appointment
0.075730102 0.037033087 0.009983574 0.009215607 0.008916953 0.007594343
always gp service ! time helpful
0.006058408 0.006037076 0.005994411 0.005887749 0.005759754 0.005717089
thank doctor care well doctors friendly
0.005461100 0.004650468 0.004543806 0.004479809 0.004181155 0.004074493
reception get
0.003669177 0.003605180
```

```
. , appointment get call surgery
0.067154285 0.032439121 0.012286207 0.011866354 0.009015778 0.008662218
! told phone practice time gp
0.008087683 0.007579440 0.007468953 0.006143103 0.006032616 0.005767446
doctor day back one can just
0.004905644 0.004552084 0.004331109 0.004242719 0.004132231 0.004132231
appointments ?
0.004088036 0.004043841
```

These words largely make sense. In both positive and negative reviews, patients are likely to use language related to healthcare – “staff”, “practice”, “surgery”, “appointment”, “doctor”, etc – but the language model for the “Positive” category also puts higher probabilities on many positively-valanced words such as “thank”, “helpful”, and “friendly”. By contrast, the negative category puts high probabilities on words that are likely to be linked to unsatisfactory service such as “time”, “service”, “day”, and “!”.

- Estimate the predicted probability of a positive review from the Naive Bayes model. Use these probabilities to examine the
`review_title`

variable: which patient review titles are most likely to be positive reviews? Which are most likely to be negative? Do these make sense?

## Reveal code

```
nhs_reviews_dfm_train$positive_nb_probability <- predict(nb_train, type = "probability")[,2]
nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = T)[1:10]]
```

```
[1] "Good GP Practice" "Efficient and compassionate service"
[3] "Fantastic practice" "Caring and professional service"
[5] "Nurse was amazing" "Fantastic Practice"
[7] "Consistently good care" "Exemplary Practice"
[9] "Fantastic practice" "Fantastic Practice"
```

`nhs_reviews_dfm_train$review_title[order(nhs_reviews_dfm_train$positive_nb_probability, decreasing = F)[1:10]]`

```
[1] "Shambles"
[2] "Unacceptable practice"
[3] "Impossible to get a service"
[4] "Unprofessional, Poor and Dangerous"
[5] "Impossible to get an appointment!!!"
[6] "Honestly, you’re better off not having a GP"
[7] "Disappointed with appointments."
[8] "Not so great practice!!"
[9] "Very bad practice and arrogant receptionists"
[10] "Extremely rude receptionist"
```

Yes! It is very clear from these lists that the model is doing a reasonably good job of distinguishing between positive and negative reviews. To find out

howgood a job it is doing, we need to do more than a face validity check.

- Use your fitted Naive Bayes model to predict the category (positive or negative) for each observation in the
*training*data. Then use the`table()`

function to create a confusion matrix which compares the predicted positive reviews to the true coding for positive reviews (which is stored in the`review_positive`

variable). As with the dictionary analysis, save the output of the`table()`

function as a new object.

## Reveal code

```
## Training set accuracy
nhs_reviews_dfm_train$predicted_classification_nb <- predict(nb_train, type = "class")
confusion_train <- table(predicted_classification = nhs_reviews_dfm_train$predicted_classification_nb,
true_classification = nhs_reviews_dfm_train$review_positive)
confusion_train
```

```
true_classification
predicted_classification Negative Positive
Negative 565 58
Positive 13 932
```

- Calculate the accuracy, sensitivity and specificity of your classifier using the
`confusionMatrix()`

function. Compare these scores to those that you obtained from the dictionary analysis. Which performs best?

## Reveal code

```
confusion_train_statistics <- confusionMatrix(confusion_train,
positive = "Positive")
confusion_train_statistics
```

```
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 565 58
Positive 13 932
Accuracy : 0.9547
95% CI : (0.9432, 0.9645)
No Information Rate : 0.6314
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9043
Mcnemar's Test P-Value : 1.772e-07
Sensitivity : 0.9414
Specificity : 0.9775
Pos Pred Value : 0.9862
Neg Pred Value : 0.9069
Prevalence : 0.6314
Detection Rate : 0.5944
Detection Prevalence : 0.6027
Balanced Accuracy : 0.9595
'Positive' Class : Positive
```

The Naive Bayes model clearly outperforms the dictionary analysis. The sensitivity of the classifier is now 94%, and the accuracy is 95%. We are doing a

muchbetter job at detecting sentiment in the training set using the Naive Bayes approach.

- Repeat the assessment of classifier accuracy but now for the test set observations. How does the classifier perform on these data compared to its performance on the training data?

## Reveal code

```
## Test set accuracy
nhs_reviews_dfm_test$predicted_classification_nb <- predict(nb_train,
newdata = nhs_reviews_dfm_test,
type = "class")
confusion_test <- table(predicted_classification = nhs_reviews_dfm_test$predicted_classification_nb,
true_classification = nhs_reviews_dfm_test$review_positive)
confusion_test_statistics <- confusionMatrix(confusion_test,
positive = "Positive")
confusion_test_statistics
```

```
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 170 35
Positive 6 207
Accuracy : 0.9019
95% CI : (0.8693, 0.9287)
No Information Rate : 0.5789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8032
Mcnemar's Test P-Value : 1.226e-05
Sensitivity : 0.8554
Specificity : 0.9659
Pos Pred Value : 0.9718
Neg Pred Value : 0.8293
Prevalence : 0.5789
Detection Rate : 0.4952
Detection Prevalence : 0.5096
Balanced Accuracy : 0.9106
'Positive' Class : Positive
```

The accuracy on the test set is 90%, with a sensitivity of 86% and a specificity of 97%. The performance on the test set is a fraction worse than on the training set, though it is still significantly better than the dictionary analysis.

## 4.5 Homework

Create two new DFMs from the

`nhs_reviews`

data, each time making different feature selection decisions (i.e. trimming, n-grams, stopword removal, etc). Train two new Naive Bayes models using these DFMs and compare their performance on the test set data to the performance of the model that you created above.Create a table comparing the results the models you have estimated. Report the feature selection decisions you made for each model. Which performs best in terms of accuracy, sensitivity and specificity ?

Upload your table to this Moodle page.

## Reveal code

```
## No punctuation DFM
# Convert to a DFM
nhs_reviews_dfm_nopunct <- nhs_reviews_corpus %>%
tokens(remove_punct = T) %>%
dfm()
# Subset to training set observations
nhs_reviews_dfm_nopunct_train <- dfm_subset(nhs_reviews_dfm_nopunct, train)
# Subset to test set observations
nhs_reviews_dfm_nopunct_test <- dfm_subset(nhs_reviews_dfm_nopunct, !train)
# Train a new Naive Bayes model
nb_train_nopunct <- textmodel_nb(x = nhs_reviews_dfm_nopunct_train,
y = nhs_reviews_dfm_nopunct_train$review_positive,
prior = "docfreq")
## Test set accuracy
nhs_reviews_dfm_nopunct_test$predicted_classification <- predict(nb_train_nopunct,
newdata = nhs_reviews_dfm_nopunct_test,
type = "class")
confusion_test_nopunct <- table(predicted_classification = nhs_reviews_dfm_nopunct_test$predicted_classification,
true_classification = nhs_reviews_dfm_nopunct_test$review_positive)
confusion_test_nopunct_statistics <- confusionMatrix(confusion_test_nopunct, positive = "Positive")
confusion_test_nopunct_statistics
```

```
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 169 34
Positive 7 208
Accuracy : 0.9019
95% CI : (0.8693, 0.9287)
No Information Rate : 0.5789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8029
Mcnemar's Test P-Value : 4.896e-05
Sensitivity : 0.8595
Specificity : 0.9602
Pos Pred Value : 0.9674
Neg Pred Value : 0.8325
Prevalence : 0.5789
Detection Rate : 0.4976
Detection Prevalence : 0.5144
Balanced Accuracy : 0.9099
'Positive' Class : Positive
```

```
## N-gram model
# Convert to a DFM
nhs_reviews_dfm_ngram <- nhs_reviews_corpus %>%
tokens() %>%
tokens_ngrams(1:3) %>%
dfm()
# Subset to training set observations
nhs_reviews_dfm_ngram_train <- dfm_subset(nhs_reviews_dfm_ngram, train)
# Subset to test set observations
nhs_reviews_dfm_ngram_test <- dfm_subset(nhs_reviews_dfm_ngram, !train)
# Train a new Naive Bayes model
nb_train_ngram <- textmodel_nb(x = nhs_reviews_dfm_ngram_train,
y = nhs_reviews_dfm_ngram_train$review_positive,
prior = "docfreq")
## Test set accuracy
nhs_reviews_dfm_ngram_test$predicted_classification <- predict(nb_train_ngram,
newdata = nhs_reviews_dfm_ngram_test,
type = "class")
confusion_test_ngram <- table(predicted_classification = nhs_reviews_dfm_ngram_test$predicted_classification,
true_classification = nhs_reviews_dfm_ngram_test$review_positive)
confusion_test_ngram_statistics <- confusionMatrix(confusion_test_ngram, positive = "Positive")
confusion_test_ngram_statistics
```

```
Confusion Matrix and Statistics
true_classification
predicted_classification Negative Positive
Negative 169 35
Positive 7 207
Accuracy : 0.8995
95% CI : (0.8666, 0.9266)
No Information Rate : 0.5789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7983
Mcnemar's Test P-Value : 3.097e-05
Sensitivity : 0.8554
Specificity : 0.9602
Pos Pred Value : 0.9673
Neg Pred Value : 0.8284
Prevalence : 0.5789
Detection Rate : 0.4952
Detection Prevalence : 0.5120
Balanced Accuracy : 0.9078
'Positive' Class : Positive
```

```
accuracy <- c(confusion_test_statistics$overall[1],
confusion_test_nopunct_statistics$overall[1],
confusion_test_ngram_statistics$overall[1])
sensitivity <- c(confusion_test_statistics$byClass[1],
confusion_test_nopunct_statistics$byClass[1],
confusion_test_ngram_statistics$byClass[1])
specificity <- c(confusion_test_statistics$byClass[2],
confusion_test_nopunct_statistics$byClass[2],
confusion_test_ngram_statistics$byClass[2])
model_names <- c("Unigram", "No punctuation", "N-gram")
results <- data.frame(model_names,
accuracy,
sensitivity,
specificity)
results
```

```
model_names accuracy sensitivity specificity
1 Unigram 0.9019139 0.8553719 0.9659091
2 No punctuation 0.9019139 0.8595041 0.9602273
3 N-gram 0.8995215 0.8553719 0.9602273
```

There is very little difference between the different feature selection approaches. This is largely because predicting positivity is a very straightforward task! The words indicating positivity are very clear and so we get a strong signal from this data. Note that this will not always be the case, especially we are trying to classify a more nuanced concept or a set of more fine-grained categories.

- (
*Hard question*) Cross-validation

In this question you should implement k-fold cross-validation to assess the accuracy, sensitivity, and specificity of one of the Naive Bayes models that you estimated above.

I have provided some starter code below for a function that will take a vector of logical values as input, which should be named

`held_out`

. That vector should take on`TRUE`

values for observations in the held-out set and`FALSE`

for observations in the training set for each fold of the cross-validation.Your job is to fill in the

`Your code goes here!`

sections of the code. To do so, you can essentially copy the material on slide 58 of this week’s lecture, making modifications to the existing task where appropriate.

```
get_performance_scores <- function(held_out){
# Set up train and test sets for this fold
dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
# Train model on everything except held-out fold
# Your code goes here!
# Predict for held-out fold
# Your code goes here!
# Calculate accuracy, specificity, sensitivity
# Your code goes here!
# Save results in a data.frame
return(output)
}
```

```
get_performance_scores <- function(held_out){
# Set up train and test sets for this fold
dfm_train <- dfm_subset(nhs_reviews_dfm, !held_out)
dfm_test <- dfm_subset(nhs_reviews_dfm, held_out)
# Train model on everything except held-out fold
nb_train <- textmodel_nb(x = dfm_train,
y = dfm_train$review_positive,
prior = "docfreq")
# Predict for held-out fold
dfm_test$predicted_classification <- predict(nb_train,
newdata = dfm_test,
type = "class")
# Calculate accuracy, specificity, sensitivity
confusion_nb <- table(predicted_classification = dfm_test$predicted_classification,
true_classification = dfm_test$review_positive)
confusion_nb_statistics <- confusionMatrix(confusion_nb, positive = "Positive")
accuracy <- confusion_nb_statistics$overall[1]
sensitivity <- confusion_nb_statistics$byClass[1]
specificity <- confusion_nb_statistics$byClass[2]
return(data.frame(accuracy, sensitivity, specificity))
}
```

- Once you have completed this function, you will then need to create a vector to represent the \(k\) folds on which you are going to test the data in the cross validation. To do so, you will need to use the
`sample()`

function to produce a vector that has the same number of elements as there are rows in the`nhs_reviews_dfm`

object. Again, you can see how to do this in the lecture slides.

- Finally, you will need to apply the
`get_function_scores()`

function to each fold. To do so, you may find the`lapply()`

function helpful. This function takes a vector as an input, applies a function to that vector, and returns a list object which includes the results of the function for each value of the vector that you input. Again, see the lecture for example code.

- Report your result!