3: Supervised Learning for Text

Jack Blumenau

Course Outline

Representing Text as Data (I): Bag-of-Words
Similarity, Difference, and Complexity
Language Models (I): Supervised Learning for Text Data 👈
Language Models (II): Topic Models
Collecting Text Data
Causal Inference with Text
Representing Text as Data (II): Word Embeddings
Representing Text as Data (III): Word Sequences
Language Models (III): Neural Networks, Transfer Learning and Transformer Models
Language Models (IV): Applying Large Language Models in Social Science Research

Motivation

Motivation - Is this a curry?

Motivation

What is a curry?

Oxford English Dictionary:

“A preparation of meat, fish, fruit, or vegetables, cooked with a quantity of bruised spices and turmeric, and used as a relish or flavouring, esp. for dishes composed of or served with rice. Hence, a curry = a dish or stew (of rice, meat, etc.) flavoured with this preparation (or with curry-powder).”

Motivation

If a curry can be defined by the spices a dish contains, then we ought to be able to predict whether a recipe is a curry from ingredients listed in recipes
We will evaluate the probability that #TheStew is a curry by training a curry classifier on a set of recipes
We will use data on 9384 recipes from the BBC recipe archive
This data includes information on
- Recipe names
- Recipe ingredients
- Recipe instructions

Motivation

Our data includes information on each recipe:

recipes$recipe_name[1]

[1] "Mustard and thyme crusted rib-eye of beef "

recipes$ingredients[1]

[1] "2.25kg/5lb rib-eye of beef, boned and rolled 450ml/Â¾ pint red wine 150ml/Â¼ pint red wine vinegar 1 tbsp sugar 1 tsp ground allspice 2 bay leaves 1 tbsp chopped fresh thyme 2 tbsp black peppercorns, crushed 2 tbsp English or Dijon mustard"

recipes$directions[1]

[1] "Place the rib-eye of beef into a large non-metallic dish. In a jug, mix together the red wine, vinegar, sugar, allspice, bay leaf and half of the thyme until well combined. Pour the mixture over the beef, turning to coat the joint evenly in the liquid. Cover the dish loosely with cling film and set aside to marinate in the fridge for at least four hours, turning occasionally. (The beef can be marinated for up to two days.) When the beef is ready to cook, preheat the oven to 190C/375F/Gas 5. Lift the beef from the marinade, allowing any excess liquid to drip off, and place on a plate, loosely covered, until the meat has returned to room temperature. Sprinkle the crushed peppercorns and the remaining thyme onto a plate. Spread the mustard evenly all over the surface of the beef, then roll the beef in the peppercorn and thyme mixture to coat. Place the crusted beef into a roasting tin and roast in the oven for 1 hour 20 minutes (for medium-rare) or 1 hour 50 minutes (for well-done). Meanwhile, for the horseradish cream, mix the crÃ¨me frÃ¢iche, creamed horseradish, mustard and chives together in a bowl until well combined. Season, to taste, with salt and freshly ground black pepper, then spoon into a serving dish and chill until needed. When the beef is cooked to your liking, transfer to a warmed platter and cover with aluminium foil, then set aside to rest in a warm place for 25-30 minutes. To serve, carve the rib-eye of beef into slices and arrange on warmed plates. Spoon the roasted root vegetables alongside. Serve with the horseradish cream."

We also have “hand-coded” information on whether each dish is really a curry:

table(recipes$curry)


    Curry Not Curry 
      290      9094

Defining a curry

head(recipes$recipe_name[recipes$curry == "Curry"])

[1] "Venison massaman curry"             "Almond and cauliflower korma curry"
[3] "Aromatic beef curry"                "Aromatic blackeye bean curry"      
[5] "Aubergine curry"                    "Bangladeshi venison curry"

A curry dictionary

Given that we have some idea of the concept we would like to measure, perhaps we can just use a dictionary:

## Convert to corpus
recipe_corpus <- corpus(recipes, text_field = "ingredients")

# Tokenize
recipe_tokens <- tokens(recipe_corpus, remove_punct = TRUE, 
                        remove_numbers = TRUE, remove_symbols = TRUE) %>%
                 tokens_remove(c(stopwords("en"),
                    "ml","fl","x","mlâ","mlfl","g","kglb",
                    "tsp","tbsp","goz","oz", "glb", "gâ", "â"))

# Convert to DFM
recipe_dfm <- recipe_tokens %>%
    dfm() %>%
    dfm_trim(max_docfreq = .3, 
             min_docfreq = .002, 
             docfreq_type = "prop") 

topfeatures(recipe_dfm, 20)

   finely     sugar     flour    sliced    garlic    peeled       cut freerange 
     3707      3118      2486      2456      2362      2333      2299      2196 
   leaves     juice     white       red     large     extra    caster     seeds 
     1859      1757      1730      1673      1658      1626      1615      1541 
    small vegetable     onion     plain 
     1498      1493      1485      1450

A curry dictionary

curry_dict <- dictionary(list(curry = c("spices", 
                                        "turmeric")))

curry_dfm <- dfm_lookup(recipe_dfm, dictionary = curry_dict)

curry_dfm$recipe_name[order(curry_dfm[,1], decreasing = T)[1:10]]

 [1] "Indonesian stir-fried rice (Nasi goreng)"                       
 [2] "Pineapple, prawn and scallop curry"                             
 [3] "Almond and cauliflower korma curry"                             
 [4] "Aloo panchporan (Stir-fried potatoes tempered with five spices)"
 [5] "Aromatic beef curry"                                            
 [6] "Asian-spiced rice with coriander-crusted lamb and rosemary oil" 
 [7] "Beef chilli flash-fry with yoghurt rice"                        
 [8] "Beef rendang with mango chutney and sticky rice"                
 [9] "Beef curry with jasmine rice"                                   
[10] "Beef Madras"

Classification Perfomance

Let’s classify a recipe as a “curry” if it includes any of our dictionary words

recipes$curry_dictionary <- ifelse(as.numeric(curry_dfm[,1]) > 0, "Curry", "Not Curry")

confusion_dictionary <- table(predicted_classification = recipes$curry_dictionary,
                                   true_classification = recipes$curry)

library(caret)

confusionMatrix(confusion_dictionary, positive = "Curry")

Confusion Matrix and Statistics

                        true_classification
predicted_classification Curry Not Curry
               Curry        95       179
               Not Curry   195      8915
                                        
               Accuracy : 0.9601        
                 95% CI : (0.956, 0.964)
    No Information Rate : 0.9691        
    P-Value [Acc > NIR] : 1.000         
                                        
                  Kappa : 0.3164        
                                        
 Mcnemar's Test P-Value : 0.438         
                                        
            Sensitivity : 0.32759       
            Specificity : 0.98032       
         Pos Pred Value : 0.34672       
         Neg Pred Value : 0.97859       
             Prevalence : 0.03090       
         Detection Rate : 0.01012       
   Detection Prevalence : 0.02920       
      Balanced Accuracy : 0.65395       
                                        
       'Positive' Class : Curry

\[\text{Accuracy} = \frac{\#\text{True Positives} + \#\text{True Negatives}}{\# \text{Observations} }\]

\[\text{Sensitivity} = \frac{\#\text{True Positives}}{\# \text{True Positives} + \# \text{False Negatives} }\] \[\text{Specificity} = \frac{\#\text{True Negatives}}{\# \text{True Negative} + \# \text{False Positives} }\]

Implication:

We can pick up some signal with the dictionary, but we are not doing a great job of classifying curries
Our sensitivity is a very low
We need methods that are better at working out the relationships between words and categories

Supervised Learning for Text

Supervised Learning vs Dictionaries

Supervised learning methods classify documents into pre-defined categories on the basis of the words they contain.

Supervised learning can be conceptualized as a generalization of dictionary methods
Dictionaries:
- Words associated with each category are pre-specified by the researcher
- Words typically have a weight of either zero or one
- Documents are scored on the basis of words they contain
Supervised learning:
- Words are associated with categories on the basis of pre-labelled training data
- Words have are weighted according to their relative prevalence in each each category
- Documents are scored on the basis of words they contain
The key difference is that in supervised learning the features associated with each category (and their relative weight) are learned from the data
A major advantage of supervised learning methods is that the weights we estimate are specific to the corpus with which we are working (not true generally of dictionaries)
Supervised learning methods will often outperform dictionary methods in classification tasks, particularly when the training sample is large

Components of Supervised Learning

Labelled dataset
- Labelled (normally hand-coded) data which categorizes texts into different categories
- Training set: used to train the classifier
- Test set: used to validate the classifier
Classification method
- Statistical method to:
  - learn the relationship between coded texts and words
  - predict unlabeled documents from the words they contain
- Examples: Naive Bayes, Logistic Regression, SVM, tree-based methods, many others…
Validation method
- Predictive metrics such as confusion matrix, accuracy, sensitivity, specificity, etc
- Normally we use a specific type of validation known as cross-validation
Out-of-sample prediction
- Use the estimated statistical model to predict categories for documents that do not have labels

Creating a labelled datset

How do we obtain a labelled set?

External sources of annotation, e.g.
- Party labels for election manifestos
- Disputed authorship of Federalist papers estimated based on known authors of other documents
Expert annotation, e.g.
- In many projects, undergraduate students (“expertise” comes from training)
- Existing expert annotations, e.g. Comparative Manifesto Project
Crowd-sourced coding, e.g.
- Ask random people on the internet to code texts into categories
- Tends to rely on the “wisdom of crowds” hypothesis: aggregated judgments of non-experts converge to judgments of experts at much lower cost

For the purposes of the running example, we are cheating a bit by assuming that any dish whose title contains the word “curry” is, in fact, a curry.

In a more serious application, we would hand-code individual curry recipes as “curry” or “not curry”, but we are taking a short-cut here.

Naive Bayes Classification

Language Models

A language model is any model that represents text data probabilistically, learning patterns in how words appear together in order to make predictions, classify text, or generate new content.

Throughout this course, we will focus on language models for:

Classification
- This week
Latent representations
- Next week
Text sequences
- Week 8 and 9
Text generation
- Week 10

A language model is a tool that assigns probabilities to words and phrases on the basis of past observations. Once we have such a model, we can use it for many different tasks.

Language Models

Probabilistic language models describe a story about how documents are generated using probability
This data-generating process is based on a set of unknown parameters which we infer based on the data
Once we have inferred values for the parameters, we can reverse the data-generating process and calculate the probability that any given document was generated by a particular language model
The Naive Bayes text classification model is one example of a generative language model. In Naive Bayes:
1. Estimate separate language models for each category of interest
2. Calculate probability that each text was generated by each model
3. Assign the text to the category for which it has the highest probability

Language Models

The basis of any language model is a probability distribution over words in a vocabulary.
A probability distribution over a discrete variable must have three properties
- Each element must be greater than or equal to zero
- Each element must be less than or equal to one
- The sum of the elements must be 1

Language Models

Consider a 6 word vocabulary: “coriander”, “turmeric”, “garlic”, “sugar”, “flour”, “eggs”

When writing a curry recipe, you will
- frequently use the words “coriander”, “turmeric”, and “garlic”
- infrequently use the words “sugar”, “flour”, and “eggs”

When writing a cake recipe, you will
- frequently use the words “sugar”, “flour”, and “eggs”
- infrequently use the words “coriander”, “turmeric”, and “garlic”

We can represent these different “models” for language using a probability distribution over the words in the vocabulary:

Model	coriander	turmeric	garlic	sugar	flour	eggs
\(\mu_\text{curry}\)	0.4	0.25	0.20	0.08	0.04	0.03
\(\mu_\text{cake}\)	0.02	0.01	0.01	0.26	0.4	0.3

Note that no word has a probability of 0 under either model

Language Models

Model	coriander	turmeric	garlic	sugar	flour	eggs
\(\mu_\text{curry}\)	0.4	0.25	0.20	0.08	0.04	0.03
\(\mu_\text{cake}\)	0.02	0.01	0.01	0.26	0.4	0.3

Given these models, we can calculate the probability that a given set of word counts (i.e. a document) would be drawn from each distribution

\[P(W_i|\mu) = \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_j^{W_{ij}}\]

This is the multinomial distribution
When applied to texts, the multinomial distribution describes the likelihood of seeing a specific combination of word counts in a document.
For instance, if we had a document that contains “coriander” 6 times, “turmeric” 2 times, and “garlic” 1 time, the multinomial allows us to calculate the probability of seeing those word counts under each model (\(\mu_\text{curry}\) and \(\mu_\text{cake}\))

Language Models

Model	coriander	turmeric	garlic	sugar	flour	eggs
\(\mu_\text{curry}\)	0.4	0.25	0.20	0.08	0.04	0.03
\(\mu_\text{cake}\)	0.02	0.01	0.01	0.26	0.4	0.3

Given these models, we can calculate the probability that a given set of word counts (i.e. a document) would be drawn from each distribution

\[P(W_i|\mu) = \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_j^{W_{ij}}\]

\(\mu_j\) is the probability of observing word \(j\) under a given model
\(W_{i,j}\) is the number of times word \(j\) appears in document \(i\) (i.e. it is an element of a dfm)
\(M_i\) is the total number of words in document \(i\)
\(!\) is the factorial operator \((n! = n \times (n-1) \times (n-2) \times ... \times 1)\)

Language Models

Model	coriander	turmeric	garlic	sugar	flour	eggs
\(\mu_\text{curry}\)	0.4	0.25	0.20	0.08	0.04	0.03
\(\mu_\text{cake}\)	0.02	0.01	0.01	0.26	0.4	0.3

Imagine we have two documents represented by the following DFM

Document	coriander	turmeric	garlic	sugar	flour	eggs
\(W_1\)	6	2	1	1	0	0
\(W_2\)	1	0	0	4	2	3

Which language model is most likely to have produced each document?

\[P(W_1|\mu_\text{curry}) = \frac{M_i!}{\prod_{j=1}^JW_{1,j}!}\prod_{j=1}^J\mu_j^{W_{1,j}} = \frac{10!}{(6!)(2!)(1!)(1!)}\times(.4)^6\times(.25)^2\times(.2)^1\times(.08)^1 = 0.01\]

\[P(W_1|\mu_\text{cake}) = \frac{M_i!}{\prod_{j=1}^JW_{1,j}!}\prod_{j=1}^J\mu_j^{W_{1,j}} = \frac{10!}{(6!)(2!)(1!)(1!)}\times(.02)^6\times(.01)^2\times(.01)^1\times(.26)^1 = 0.000000000000042\]

Implication: The probability of observing \(W_1\) is higher under \(\mu_\text{curry}\) than under \(\mu_\text{cake}\).

\[P(W_2|\mu_\text{curry}) = \frac{M_i!}{\prod_{j=1}^JW_{2,j}!}\prod_{j=1}^J\mu_j^{W_{2,j}} = \frac{10!}{(1!)(4!)(2!)(3!)}\times(.4)^1\times(.26)^4\times(.4)^2\times(.3)^3 = 0.0000000089\]

\[P(W_2|\mu_\text{cake}) = \frac{M_i!}{\prod_{j=1}^JW_{2,j}!}\prod_{j=1}^J\mu_j^{W_{2,j}} = \frac{10!}{(1!)(4!)(2!)(3!)}\times(.02)^1\times(.26)^4\times(.4)^2\times(.3)^3 = 0.005\]

Implication: The probability of observing \(W_2\) is higher under \(\mu_\text{cake}\) than under \(\mu_\text{curry}\).

Conclusion: Given a set of probabilities, we can work out which model was most likely to have generated any given document.

The likelihood of a document being generated by a given model will be

…larger when the model gives higher probabilities to the words that occur frequently in the document
…smaller when the model gives higher probabilities to the words that occur infrequently in the document

Naive Bayes

Naive Bayes is a model that classifies documents into categories on the basis of the words they contain

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]

\[{\color{violet}{P(y_i = C_k|W_i)}} = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)\color{violet}{P(W_i|y_i=C_k)}}{P(W_i)}\]

\[P(y_i = C_k|W_i) = \frac{{\color{violet}{P(y_i = C_k)}}P(W_i|y_i=C_k)}{P(W_i)}\]

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{\color{violet}{P(W_i)}}\]

\(\color{violet}{P(Y = C_k|W)}\) is the posterior distribution – this tells us the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\)

\(\color{violet}{P(W|Y=C_k)}\) is the conditional probability or likelihood – this tells us the probability that we would observe the words in \(W_i\) if the document were from category \(k\)

\(\color{violet}{P(Y = C_k)}\) is the prior probability that the document is from category \(k\) – this tells us the probability of the category of the document, absent any information about the words it contains

\(\color{violet}{P(W_i)}\) is the unconditional probability of the words in document \(i\) – this tells us the probability that we would observe the words in \(W_i\) across all categories

Naive Bayes

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]

Generally, we will want to make comparisons of the probabilities between different classes
- e.g. Is \(P(y_i = C_\text{curry}|W_i) > P(y_i = C_\text{not curry}|W_i)\)

This means that we can drop the \(P(W_i)\) term and just focus on the likelihood and the prior probabilities

\[P(y_i = C_k|W_i) \propto P(y_i = C_k)P(W_i|y_i=C_k)\]

where \(\propto\) means “proportional to” (rather than “equal than” for \(=\))

Naive Bayes

\[P(y_i = C_k|W_i) \propto P(y_i = C_k)P(W_i|y_i=C_k)\]

To work out the whether a document should be labelled as belonging to a particular class, we need:

the prior probability (\(\color{violet}{P(Y = C_k)}\)) that the document is from category \(k\)
- This is usually estimated by calculating the proportion of documents of category \(k\) in the training data

the conditional probability or likelihood (\(\color{violet}{P(W|Y=C_k)}\)) of the words in the document occuring in category \(k\)
- We already know that we can calculate this probability from the multinomial distribution!
- Again, because we are only interested in the relative probabilities of different classes, we can drop the multinomial coefficient

\[\begin{eqnarray} P(W_i|y_i = C_k) &=& \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_{j(k)}^{W_{ij}}\\ &\propto&\prod_{j=1}^J\mu_{j(k)}^{W_{ij}} \end{eqnarray}\]

Question: How do we estimate \(\mu\)?

Naive Bayes Estimation

\(\mu_{j(k)}\) is the probability that word \(j\) will occur in documents of category \(k\).

We can estimate these probabilities from our training data:

\[\hat{\mu}_{j(k)} = \frac{\color{violet}{W_{j(k)}}}{\color{darkred}{\sum_{j\in V}W_{j(k)}}} = \frac{\text{number of times j appears in category k}}{\text{total number of words in category k}}\]

Example:

In the curry recipes our training data, we observe…
- …77 instances of the word “turmeric” (\(\color{violet}{W_{\text{turmeric}(\text{curry})}} = \color{violet}{77}\))
- …10586 total words (\(\color{darkred}{\sum_{j\in V}W_{j(\text{curry})}} = \color{darkred}{10586}\))
- …and so \(\hat{\mu}_{\text{turmeric},\text{curry}} = \frac{\color{violet}{W_{\text{turmeric}(\text{curry})}}}{\color{darkred}{\sum_{j\in V}W_{j(\text{curry})}}} = \frac{\color{violet}{77}}{\color{darkred}{10586}} = 0.007\)
In the not-curry recipes our training data, we observe…
- …148 instances of the word “turmeric” (\(\color{violet}{W_{\text{turmeric}(\text{not curry})}} = \color{violet}{148}\))
- …210805 total words (\(\color{darkred}{\sum_{j\in V}W_{j(\text{not curry})}} = \color{darkred}{210805}\))
- …and so \(\hat{\mu}_{\text{turmeric},\text{not curry}} =\frac{\color{violet}{W_{\text{turmeric}(\text{not curry})}}}{\color{darkred}{\sum_{j\in V}W_{j(\text{not curry})}}} = \frac{\color{violet}{148}}{\color{darkred}{210805}} = 0.0007\)
The word “turmeric” is about 10 times more common in curry recipes than other recipes

Naive Bayes Estimation – Laplace Smoothing

What happens when a given word doesn’t appear at all for one of the classes in our training data?

Imagine that we never observe the word “duck” in the curry recipes in our training data

\[\hat{\mu}_{\text{duck},\text{curry}} =\frac{\color{violet}{W_{\text{duck}(\text{curry})}}}{\color{darkred}{\sum_{j\in V}W_{j(\text{curry})}}} = \frac{\color{violet}{0}}{\color{darkred}{10586}} = 0\]

Then, in our test data, we observe the following sentence:

> "For this curry you will need to coat the duck legs with 1 tsp ground turmeric"

Because we multiply together all the individual word probabilities when we calculate the probability of a sentence occurring in a category, we will get a probability of zero!

Solution: Add one to the counts for each word in each category

\[\hat{\mu}_{\text{duck},\text{curry}} =\frac{\color{violet}{W_{\text{duck}(\text{curry})}+1}}{\color{darkred}{\sum_{j\in V}(W_{j(\text{curry})}+1)}} = \frac{\color{violet}{1}}{\color{darkred}{10587}} = 0.00009\]

This solution is known as “add-one” or “Laplace” smoothing

Naive Bayes Classification

The classification decision made by the Naive Bayes model is simple: we assign document \(i\) to the category, \(k\), for which it has the highest posterior probability:

\[ \hat{Y}_i = \underset{k \in \{1,...,k\}}{\operatorname{argmax}} P(y_i = C_k) \times P(W_i|y_i = C_k) \]

where \(\underset{k \in \{1,...,k\}}{\operatorname{argmax}}\) means “which category, \(k\), has the maximum posterior probability”.

Intuition:

Assign documents to categories when the probability of observing the words in that document are high given the probability distribution for that category (i.e. when \(P(W_i|y_i = C_k)\) is large)
Assign more documents to categories that contain more documents in the training data (i.e. when \(P(y_i = C_k)\) is large)

Why is Naive Bayes “Naive”?

By treating documents as bags of words we are assuming:

Conditional independence of word counts
- Knowing a document contains one word doesn’t tell us anything about the probability of observing other words in that document
- e.g. The fact that a recipe includes the word “turmeric” doesn’t make it any more or less likely that it will also include the word “coriander”
Positional independence of word counts
- The position of a word within a document doesn’t give us any information about the category of that document
- e.g. Whether the word “turmeric” appears early or late in the recipe has no effect on the probability of it being a curry
- e.g. Whether the word “good” appears after the word “not” has no effect on the probability of it being a “positive” document

While this is a very simple model of language which is “wrong”, it is nevertheless useful for classification.

Despite its naive assumptions, Naive Bayes often performs well because words in different categories tend to occur in distinct patterns, even if they’re not truly independent.

Naive Bayes Application

nb_output <- textmodel_nb(x = recipe_dfm, 
                         y = recipe_dfm$curry,
                         prior = "docfreq")
summary(nb_output)


Call:
textmodel_nb.dfm(x = recipe_dfm, y = recipe_dfm$curry, prior = "docfreq")

Class Priors:
(showing first 2 elements)
    Curry Not Curry 
   0.0309    0.9691 

Estimated Feature Scores:
              beef     boned    rolled     pint      red     wine  vinegar
Curry     0.001378 0.0014925 0.0001148 0.001148 0.011481 0.001607 0.001378
Not Curry 0.003304 0.0006107 0.0004031 0.003847 0.009619 0.007750 0.005154
            sugar  allspice      bay  leaves     thyme peppercorns  crushed
Curry     0.00620 0.0002296 0.002067 0.01378 0.0004592    0.003789 0.009070
Not Curry 0.01872 0.0003420 0.002821 0.01063 0.0048734    0.001826 0.005417
            english     dijon  mustard unsalted      room temperature      lard
Curry     0.0002296 0.0001148 0.005166 0.001493 0.0001148   0.0001148 0.0001148
Not Curry 0.0007023 0.0009832 0.002803 0.004953 0.0005741   0.0005924 0.0004153
             plain    flour   white    water   chilled     icing  chicken
Curry     0.004822 0.006085 0.00287 0.007003 0.0001148 0.0003444 0.005855
Not Curry 0.008611 0.014871 0.01042 0.006693 0.0005863 0.0023573 0.007035
              cut   pieces
Curry     0.01297 0.005511
Not Curry 0.01336 0.003786

The class priors represent the prior probability of a document belonging to a particular category, \(k\), before considering any of the words in the document.
The feature scores represent how likely a word is to occur in a document from a particular category, based on the training data.

Naive Bayes Application

Recall that we are interested in the probability of observing word \(j\) given class \(k\), i.e.

\[\mu_{j(k)} = \frac{W_{j(k)}}{\sum_{j\in V}W_{j(k)}}\]

What are these word probabilities for our curry data?

We can examine the probability of each word given each class using the coef() function on the nb_train object.

head(coef(nb_output))

              Curry    Not Curry
beef   0.0013777268 0.0033038975
boned  0.0014925373 0.0006107019
rolled 0.0001148106 0.0004030633
pint   0.0011481056 0.0038474222
red    0.0114810563 0.0096185556
wine   0.0016073479 0.0077498076

Naive Bayes Application

Words with highest probability in the “curry” class (i.e. \(P(w_j|c_k = \text{``curry''})\)):

head(sort(coef(nb_output)[,1], decreasing = TRUE), 20)

      seeds      finely   coriander      peeled      garlic   vegetable 
0.030080367 0.023191734 0.020551091 0.018025258 0.016647532 0.015269805 
     ginger      cloves      leaves       green       cumin         cut 
0.015154994 0.014695752 0.013777268 0.013662457 0.013547646 0.012973594 
     powder      chilli         red    turmeric       onion       piece 
0.012973594 0.012743972 0.011481056 0.011136625 0.010332951 0.010103330 
     sliced       large 
0.009644087 0.009529277

Words with highest probability in the “not curry” class (i.e. \(P(w_j|c_k = \text{``not curry''})\)):

head(sort(coef(nb_output)[,2], decreasing = TRUE), 20)

     finely       sugar       flour      sliced      garlic         cut 
0.021417317 0.018724122 0.014870592 0.014498064 0.013551476 0.013362158 
     peeled   freerange      leaves       white       juice      caster 
0.013301088 0.013252232 0.010632321 0.010424682 0.010296435 0.009844515 
      extra       large         red         egg       small       plain 
0.009807873 0.009630770 0.009618556 0.008659754 0.008653647 0.008610897 
      onion   vegetable 
0.008531506 0.008317760

Naive Bayes Application

What are the class-conditional word probabilities for “Aromatic blackeye bean curry”?

          P(w|curry) P(w|not curry)
seeds          0.030          0.008
finely         0.023          0.021
coriander      0.021          0.005
peeled         0.018          0.013
garlic         0.017          0.014
ginger         0.015          0.005
cloves         0.015          0.008
leaves         0.014          0.011
cumin          0.014          0.002
chilli         0.013          0.006
onion          0.010          0.009
piece          0.010          0.002

What are the class-conditional word probabilities for “Schichttorte”?

          P(w|curry) P(w|not curry)
large          0.010          0.010
sugar          0.006          0.019
flour          0.006          0.015
paste          0.006          0.001
plain          0.005          0.009
lemon          0.004          0.008
freerange      0.003          0.013
eggs           0.002          0.008
zest           0.002          0.005
unsalted       0.001          0.005
caster         0.001          0.010
cornflour      0.000          0.001

Naive Bayes Application

Which recipes are predicted to have a high curry probability?

recipe_dfm$curry_nb_probability <- predict(nb_output, 
                                           type = "probability")

recipe_dfm$recipe_name[order(recipe_dfm$curry_nb_probability[,1], decreasing = T)[1:10]]

 [1] "Bengali butternut squash with chickpeas"        
 [2] "Chickpea curry with green mango and pomegranate"
 [3] "Green coconut fish curry"                       
 [4] "Thai green prawn curry"                         
 [5] "Rogan josh"                                     
 [6] "Bengal coconut dal"                             
 [7] "Tom yum soup"                                   
 [8] "Thai-style duck red curry"                      
 [9] "Peppery hot cabbage salad"                      
[10] "Peppery hot cabbage salad"

Which recipes are predicted to have a low curry probability?

recipe_dfm$recipe_name[order(recipe_dfm$curry_nb_probability[,1], decreasing = F)[1:10]]

 [1] "Sticky toffee apple pudding with calvados caramel sauce"
 [2] "Rich moist all-purpose fruit cake"                      
 [3] "Mini stollen "                                          
 [4] "Chocolate fruit cake"                                   
 [5] "Pheasant pithiviers"                                    
 [6] "Spiced poached pears with chocolate pudding"            
 [7] "Traditional Christmas pudding with brandy butter"       
 [8] "Intense chocolate cookies"                              
 [9] "Cookies and cream fudge brownies"                       
[10] "Bonfire night brioche"

Was #TheStew really #TheCurry?

The purpose of training a classification model is to make out-of-sample predictions
Generally, we have a small hand-coded training dataset and then we predict for lots of other documents
Here, we are only predicting for one out-of-sample observation

ingredients <- c("cup olive oil, plus more for serving garlic cloves, chopped large yellow onion, chopped (2-inch) piece ginger, finely chopped Kosher salt and black pepper teaspoons ground turmeric, plus more for serving teaspoon red-pepper flakes, plus more for serving (15-ounce) cans chickpeas, drained and rinsed (15-ounce) cans full-fat coconut milk cups vegetable or chicken stock bunch Swiss chard, kale or collard greens, stems removed, torn into bite-size pieces cup leaves, mint for serving Yogurt, for serving (optional) Toasted pita, lavash or other flatbread, for serving (optional)")

dfm_stew <- tokens(ingredients) %>%
            dfm() %>%
            dfm_match(features = featnames(recipe_dfm))

predict(nb_output, newdata = dfm_stew, type = "probability")

       
docs        Curry  Not Curry
  text1 0.9611718 0.03882815

Yes!

Advantages and Disadvantages of Naive Bayes

Advantages

Fast
- Takes seconds to compute, even for very large vocabularies/corpuses
Easy to apply
- One line of code in quanteda
Can easily be extended to include…
- … multiple categories
- … different text representations (bigrams, tri-grams etc)

Advantages and Disadvantages of Naive Bayes

Disadvantages

Independence assumption
- Independence means NB is unable to account for interactions between words
  - e.g. When the word “eggs” appears with the word “sugar” that should indicate something different from when “eggs” appears without the word “sugar”
- Independence also means that NB is often overconfident
  - Each additional word counts as a new piece of information
- In some contexts, the independence assumption can decrease predictive accuracy
Linear classifier
- Other methods (e.g. SVM) allow the classification probabilities to change non-linearly in the word counts
- e.g. Perhaps seeing the word “eggs” once should have a smaller effect on the probability that the recipe is a curry than seeing the word “eggs” five times

Break

Validating Supervised Learning Classifiers

How can we assess the classification performance of our supervised learning classifier?
Our goal is to measure the degree to which the predictions we make correspond to the observed data
We have already seen some ways to do this
- Accuracy – the proportion of all predictions that match the observed data
- Sensitivity – the proportion of “true positive” predictions that match the observed data
- Specificity – the proportion of “true negative” predictions that match the observed data
In order to get informative estimates of these quantities, we need to distinguish between the performance of the classifier on the training set and the test set

Training Error versus Test error

The test error is the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in training the method.
In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training.
Training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.

Bias-Variance Trade-Off

We can think of the test error associated with any given statistical estimator as coming from two fundamental quantities:

Bias
- The bias of an estimator is the error that is introduced by approximating a complicated set of relationships with a simple model that doesn’t characterise the full complexity
Variance
- The variance of an estimator is the amount that the predictions produced by the estimator would change if it had been estimated on different data

Ideally we would like to minimize both variance and bias, but these goals are often at odds with each other.

Training- versus Test-Set Performance

As we use more flexible models, the variance will increase and the bias will decrease
The relative rate of change of these two quantities determines whether the test error increases or decreases
As we start to make the model more flexible the bias will tend to decrease faster than the variance will increase
After some point, adding more flexibility will decrease the the bias a bit, but the variance will increase a lot

Implication: We need tools which tell us when we have reached the optimal balance between bias and variance.

Test-set approach

We randomly divide the available set of samples into two parts: a training set and a test set.
The model is fit on the training set, and the fitted model is used to predict the responses for the test set.
We then calculate classification performance scores (accuracy, sensitivity, specificity, etc) for the test set.

Naive Bayes Application

Before we train a model, we need to separate our data into a training set and a test set:

## Training and test set

train <- sample(c(TRUE, FALSE), nrow(recipes), replace = TRUE, prob = c(.8, .2))
test <- !train

table(train)

train
FALSE  TRUE 
 1877  7507

table(test)

test
FALSE  TRUE 
 7507  1877

How many curry recipes are there in the training and test sets?

## Training and test set

prop.table(table(recipes$curry[train]))


     Curry  Not Curry 
0.03157053 0.96842947

prop.table(table(recipes$curry[test]))


     Curry  Not Curry 
0.02823655 0.97176345

Naive Bayes Application

We then subset the recipe_dfm object into a training dfm and a test dfm:

## Naive Bayes

recipe_dfm_train <- dfm_subset(recipe_dfm, train)
recipe_dfm_test <- dfm_subset(recipe_dfm, test)

We then train our Naive Bayes model on the training set:

nb_train <- textmodel_nb(x = recipe_dfm_train, 
                         y = recipe_dfm_train$curry,
                         prior = "docfreq")

And finally, we predict the category of each recipe in the test set:

recipe_dfm_test$predicted_curry_nb <- predict(nb_train,
                                                newdata = recipe_dfm_test,
                                                type = "class")

Naive Bayes Classification Perfomance

confusion_nb <- table(predicted_classification = recipe_dfm_test$predicted_curry_nb,
                      true_classification = recipe_dfm_test$curry)

library(caret)

confusionMatrix(confusion_nb, positive = "Curry")

Confusion Matrix and Statistics

                        true_classification
predicted_classification Curry Not Curry
               Curry        38       101
               Not Curry    15      1723
                                          
               Accuracy : 0.9382          
                 95% CI : (0.9263, 0.9487)
    No Information Rate : 0.9718          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3701          
                                          
 Mcnemar's Test P-Value : 2.973e-15       
                                          
            Sensitivity : 0.71698         
            Specificity : 0.94463         
         Pos Pred Value : 0.27338         
         Neg Pred Value : 0.99137         
             Prevalence : 0.02824         
         Detection Rate : 0.02025         
   Detection Prevalence : 0.07405         
      Balanced Accuracy : 0.83080         
                                          
       'Positive' Class : Curry

Implication:

Relative to the dictionary approach we are…

…doing a better job on predicting true positives now (our sensitivity is much higher)
…predicting too many curries that are actually something else (our specificity is a little lower)

Training-Set and Test-Set Performance

The test set and training set accuracy can be very different
As a model becomes more flexible…
- …the training set accuracy will almost always increase
- …the test set accuracy will sometimes decrease
Imagine that we include a very large number of features in our dfm
- All unigrams, all bi-grams, …, all 5-grams
- Total number of features \(\approx\) 300k features
How does the training/test set accuracy change as we increase the number of features used to train the classifier?

Training-Set and Test-Set Accuracy

Overfitting and Test-Set Accuracy

Question: Why does the test-set accuracy decrease when we add additional features?
Answer: Because we are now overfitting our data.
Overfitting occurs when we find relationships between words (or n-grams) and curries in our training data that do not generalise to our test data
In this example, there are some n-gram phrases that appear frequently in the curry recipes in our training set but which never appear in our test-set curry recipes

Feature	Training	Test
mustard_seeds_tsp	18	0
tsp_black_mustard	15	0
tsp_black_mustard_seeds	15	0
leaves_and_stalks	20	0
black_mustard_seeds_tsp	10	0
coriander_leaves_and	14	0
cumin_seeds_tsp_black	7	0
coriander_leaves_and_stalks	13	0
large_garlic_cloves	9	0
chopped_garlic_cloves_peeled_and	15	0

Test-Set Validation for Feature Selection

We can use the test-set performance statistics to select between model specifications
We will compare the accuracy, sensitivity and specificity for the following models:
- Our “original” model (unigrams, no stopwords, trimmed)
- A “raw” model (unigrams, nothing removed)
- A “no stopwords” model (unigrams, stopwords removed)
- A “trimmed” model (unigrams, trimmed)
- An “n-gram” model (unigrams, bigrams, trigrams)
- An “n-gram, trimmed” model (unigrams, bigrams, trigrams, words occuring fewer than 10 times discarded)
The “best” model is the one which has the highest classification scores

Test-Set Validation for Feature Selection

Test-set validation
Model	Accuracy	Sensitivity	Specificity	N features
Original	0.94	0.78	0.95	902
Raw	0.96	0.65	0.97	4214
No stop words	0.96	0.66	0.97	4126
Trimmed	0.94	0.79	0.95	1339
N-gram	0.98	0.51	1	152215
N-gram, trimmed	0.94	0.83	0.94	6072

The “n-gram” model has the highest accuracy, but has very low sensitivity
The “n-gram, trimmed” model outperforms all other models in sensitivity

Cross-Validation

To calculate the test-set accuracy we randomly allocated observations to the test and training sets
If we repeat this process with a new randomization, we will get different test-set performance scores

Test-set validation
Model	Accuracy	Sensitivity	Specificity	N features
Original	0.94	0.78	0.95	902
Raw	0.96	0.65	0.97	4214
No stop words	0.96	0.66	0.97	4126
Trimmed	0.94	0.79	0.95	1339
N-gram	0.98	0.51	1	152215
N-gram, trimmed	0.94	0.83	0.94	6072

Test-set validation
Model	Accuracy	Sensitivity	Specificity	N features
Original	0.94	0.77	0.95	902
Raw	0.96	0.66	0.97	4214
No stop words	0.96	0.66	0.97	4126
Trimmed	0.94	0.78	0.94	1339
N-gram	0.98	0.45	1	152215
N-gram, trimmed	0.93	0.83	0.94	6072

Test-set validation
Model	Accuracy	Sensitivity	Specificity	N features
Original	0.94	0.78	0.95	902
Raw	0.96	0.65	0.97	4214
No stop words	0.96	0.66	0.97	4126
Trimmed	0.94	0.79	0.95	1339
N-gram	0.98	0.51	1	152215
N-gram, trimmed	0.94	0.83	0.94	6072

Test-set validation
Model	Accuracy	Sensitivity	Specificity	N features
Original	0.94	0.78	0.95	902
Raw	0.96	0.64	0.97	4214
No stop words	0.96	0.69	0.97	4126
Trimmed	0.94	0.78	0.95	1339
N-gram	0.98	0.51	1	152215
N-gram, trimmed	0.94	0.83	0.94	6072

The simple validation approach suffers from two weaknesses:
1. Estimates of test-set accuracy can be highly variable
2. We are only using a subset of the data to train the model (the observations in the training set)

Implication: We need a method that uses all data for training and generates more stable test-set accuracy.

K-fold Cross-Validation

Cross-validation is an alternative to a simple train-test split
This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size
- Typical choices are \(k=5\) or \(k=10\)
For each of the \(k\) folds we do the following
1. Train the Naive Bayes model on all observations not included in the fold
2. Generate predictions for the observations in the fold
3. Calculate the accuracy etc of the predictions for the observations in the held-out fold
We then calculate the performance metrics by averaging over those computed on each fold

K-fold Cross-Validation Application

# "held_out" is a logical vetor of true and false values
get_performance_scores <- function(held_out){
  
  # Set up train and test sets for this fold
  recipe_dfm_train <- dfm_subset(recipe_dfm, !held_out)
  recipe_dfm_test <- dfm_subset(recipe_dfm, held_out)
  
  # Train model on everything except held-out fold
  nb_train <- textmodel_nb(x = recipe_dfm_train, 
                         y = recipe_dfm_train$curry,
                         prior = "docfreq")
  
  # Predict for held-out fold
  recipe_dfm_test$predicted_curry <- predict(nb_train, 
                                             newdata = recipe_dfm_test, 
                                             type = "class")
  
  # Calculate accuracy, specificity, sensitivity
  confusion_nb <- table(predicted_classification = recipe_dfm_test$predicted_curry,
                        true_classification = recipe_dfm_test$curry)
  
  confusion_nb_statistics <- confusionMatrix(confusion_nb, positive = "Curry")
  
  accuracy <- confusion_nb_statistics$overall[1]
  sensitivity <- confusion_nb_statistics$byClass[1]
  specificity <- confusion_nb_statistics$byClass[2]
  
  return(data.frame(accuracy, sensitivity, specificity))
  
}

K-fold Cross-Validation Application

K <- 5
folds <- sample(1:K, nrow(recipe_dfm), replace = T)
get_performance_scores(folds == 1)

          accuracy sensitivity specificity
Accuracy 0.9418182    0.754386   0.9475375

all_folds <- lapply(1:5, function(k) get_performance_scores(folds == k))
all_folds

[[1]]
          accuracy sensitivity specificity
Accuracy 0.9418182    0.754386   0.9475375

[[2]]
          accuracy sensitivity specificity
Accuracy 0.9389356   0.6923077   0.9463358

[[3]]
         accuracy sensitivity specificity
Accuracy 0.935911   0.7666667   0.9414661

[[4]]
          accuracy sensitivity specificity
Accuracy 0.9420829   0.7704918   0.9478309

[[5]]
          accuracy sensitivity specificity
Accuracy 0.9380252   0.7166667   0.9452278

colMeans(bind_rows(all_folds))

   accuracy sensitivity specificity 
  0.9393546   0.7401038   0.9456796

Cross-Validation for Model Selection

5-fold cross-validation
Model	Accuracy	Sensitivity	Specificity
Original	0.94	0.74	0.95
Raw	0.96	0.6	0.97
No stop words	0.96	0.61	0.97
Trimmed	0.94	0.74	0.94
N-gram	0.96	0.21	0.99
N-gram, trimmed	0.93	0.76	0.93

10-fold cross-validation
Model	Accuracy	Sensitivity	Specificity
Original	0.94	0.74	0.94
Raw	0.96	0.62	0.97
No stop words	0.96	0.64	0.97
Trimmed	0.94	0.74	0.94
N-gram	0.96	0.26	0.98
N-gram, trimmed	0.93	0.77	0.93

Cross-Validation Uses

Cross-validation is a very general strategy for evaluating predictive fit

Which variables should I use to predict my outcome?
Should I use a linear model, or a non-linear model?
Should I just use \(X\) in my regression? Or should I also use \(X^2\)? (or \(X^3\)? or \(X^4\)?)
Etc

Extensions and Use Cases

Extensions

Naive Bayes is only one supervised learning text-classification method

Regularized Logistic Regression
- Directly models the probability that each document is in class \(k\) using logistic regression
- Regularization required to prevent overfitting data
- textmodel_lr in quanteda
Support Vector Machines
- SVMs draw a hyperplane through the multidimensional word space that best separates documents into different classes
- Can accomodate non-linear boundaries between classes
- textmodel_svm() in quanteda
“Tree-based” Classification Methods
- Tree-based methods separate classes by segmenting the predictors (word counts) into a number of distinct regions
- The modal outcome for observations that fall within a given region becomes the predicted category for any observation in that region
- Like the SVM, this allows for non-linear relationships between features and categories
- tree package in R

Use Cases of Supervised Learning

Source: https://doi-org.libproxy.ucl.ac.uk/10.1086/715165

Source: https://doi.org/10.1017/S0003055417000570

Source: https://doi.org/10.1038/s41558-021-01168-6

Conclusion

Summing Up

Supervised learning for text data allows us to learn the association between words and particular outcome categories
The Naive Bayes model is a simple model that is fast to implement and which, despite some strong assumptions, tends to provide good classification results
Once we have trained our supervised learning classifiers, it is important to validate their performance on a test-set that was not used to fit the model
Cross-validation is a general strategy for out-of-sample evaluation that can us to choose between different models