3: Supervised Learning for Text

Ashrakat Elshehawy

Course Outline

  1. Representing Text as Data (I): Bag-of-Words
  2. Similarity, Difference, and Complexity
  3. Language Models (I): Supervised Learning for Text Data 👈
  4. Language Models (II): Topic Models
  5. Collecting Text Data
  6. Causal Inference with Text
  7. Representing Text as Data (II): Word Embeddings
  8. Representing Text as Data (III): Word Sequences
  9. Language Models (III): Neural Networks, Transfer Learning and Transformer Models
  10. Language Models (IV): Generative Language Models in Social Science Research

Motivation

Motivation - Is this a curry?

Motivation

What is a curry?

Oxford English Dictionary:

“A preparation of meat, fish, fruit, or vegetables, cooked with a quantity of bruised spices and turmeric, and used as a relish or flavouring, esp. for dishes composed of or served with rice. Hence, a curry = a dish or stew (of rice, meat, etc.) flavoured with this preparation (or with curry-powder).”

Motivation

  • If a curry can be defined by the spices a dish contains, then we ought to be able to predict whether a recipe is a curry from ingredients listed in recipes

  • We will evaluate the probability that #TheStew is a curry by training a curry classifier on a set of recipes

  • We will use data on 9384 recipes from the BBC recipe archive

  • This data includes information on

    • Recipe names
    • Recipe ingredients
    • Recipe instructions

Motivation

Our data includes information on each recipe:

recipes$recipe_name[1]
[1] "Mustard and thyme crusted rib-eye of beef "
recipes$ingredients[1]
[1] "2.25kg/5lb rib-eye of beef, boned and rolled 450ml/¾ pint red wine 150ml/¼ pint red wine vinegar 1 tbsp sugar 1 tsp ground allspice 2 bay leaves 1 tbsp chopped fresh thyme 2 tbsp black peppercorns, crushed 2 tbsp English or Dijon mustard"
recipes$directions[1]
[1] "Place the rib-eye of beef into a large non-metallic dish. In a jug, mix together the red wine, vinegar, sugar, allspice, bay leaf and half of the thyme until well combined. Pour the mixture over the beef, turning to coat the joint evenly in the liquid. Cover the dish loosely with cling film and set aside to marinate in the fridge for at least four hours, turning occasionally. (The beef can be marinated for up to two days.) When the beef is ready to cook, preheat the oven to 190C/375F/Gas 5. Lift the beef from the marinade, allowing any excess liquid to drip off, and place on a plate, loosely covered, until the meat has returned to room temperature. Sprinkle the crushed peppercorns and the remaining thyme onto a plate. Spread the mustard evenly all over the surface of the beef, then roll the beef in the peppercorn and thyme mixture to coat. Place the crusted beef into a roasting tin and roast in the oven for 1 hour 20 minutes (for medium-rare) or 1 hour 50 minutes (for well-done). Meanwhile, for the horseradish cream, mix the crème frâiche, creamed horseradish, mustard and chives together in a bowl until well combined. Season, to taste, with salt and freshly ground black pepper, then spoon into a serving dish and chill until needed. When the beef is cooked to your liking, transfer to a warmed platter and cover with aluminium foil, then set aside to rest in a warm place for 25-30 minutes. To serve, carve the rib-eye of beef into slices and arrange on warmed plates. Spoon the roasted root vegetables alongside. Serve with the horseradish cream."

We also have “hand-coded” information on whether each dish is really a curry:

rbind(
  head(recipes[recipes$curry == "Curry", c("recipe_name", "curry")], 2),
  head(recipes[recipes$curry == "Not Curry", c("recipe_name", "curry")], 2)
)
                                                recipe_name     curry
6                                    Venison massaman curry     Curry
24                                         Ajwain parathas      Curry
1                Mustard and thyme crusted rib-eye of beef  Not Curry
2   Banoffee millefeuilles with chocolate and caramel icing Not Curry

Defining a curry

head(recipes$recipe_name[recipes$curry == "Curry"])
[1] "Venison massaman curry"             "Almond and cauliflower korma curry"
[3] "Aromatic beef curry"                "Aromatic blackeye bean curry"      
[5] "Aubergine curry"                    "Bangladeshi venison curry"         

A curry dictionary

Given that we have some idea of the concept we would like to measure, perhaps we can just use a dictionary:

## Convert to corpus
## We remove punctuation, numbers, and symbols
recipe_corpus <- corpus(recipes, text_field = "ingredients")

# Tokenize
recipe_tokens <- tokens(recipe_corpus, remove_punct = TRUE, 
                        remove_numbers = TRUE, remove_symbols = TRUE) %>%
  # # Also remove measurement units and cooking artefacts
                 tokens_remove(c(stopwords("en"),
                    "ml","fl","x","mlâ","mlfl","g","kglb",
                    "tsp","tbsp","goz","oz", "glb", "gâ", "â"))

# Convert to DFM
  # Trim the feature set:
  # - Remove words that appear in more than 30% of documents (too generic)
  # - Remove words that appear in fewer than 0.2% of documents (too rare)
  # This reduces noise and dimensionality
recipe_dfm <- recipe_tokens %>%
    dfm() %>%
    dfm_trim(max_docfreq = .3, 
             min_docfreq = .002, 
             docfreq_type = "prop") 

topfeatures(recipe_dfm, 20)
   finely     sugar     flour    sliced    garlic    peeled       cut freerange 
     3707      3118      2486      2456      2362      2333      2299      2196 
   leaves     juice     white       red     large     extra    caster     seeds 
     1859      1757      1730      1673      1658      1626      1615      1541 
    small vegetable     onion     plain 
     1498      1493      1485      1450 

A curry dictionary

curry_dict <- dictionary(list(curry = c("spices", 
                                        "turmeric")))
#dfm_lookup counts how often dictionary words appear in each document

curry_dfm <- dfm_lookup(recipe_dfm, dictionary = curry_dict)
# Rank recipes by how many dictionary words they contain
# curry_dfm[,1] extracts the count of curry dictionary words per recipe
curry_dfm$recipe_name[order(curry_dfm[,1], decreasing = T)[1:10]]
 [1] "Indonesian stir-fried rice (Nasi goreng)"                       
 [2] "Pineapple, prawn and scallop curry"                             
 [3] "Almond and cauliflower korma curry"                             
 [4] "Aloo panchporan (Stir-fried potatoes tempered with five spices)"
 [5] "Aromatic beef curry"                                            
 [6] "Asian-spiced rice with coriander-crusted lamb and rosemary oil" 
 [7] "Beef chilli flash-fry with yoghurt rice"                        
 [8] "Beef rendang with mango chutney and sticky rice"                
 [9] "Beef curry with jasmine rice"                                   
[10] "Beef Madras"                                                    

curry_dfm looks like this:

Classification Perfomance

Let’s classify a recipe as a “curry” if it includes any of our dictionary words

recipes$curry_dictionary <- ifelse(as.numeric(curry_dfm[,1]) > 0,
"Curry", "Not Curry")

confusion_dictionary <- 
  table(predicted_classification =
          recipes$curry_dictionary,
          true_classification = recipes$curry)
library(caret)

confusionMatrix(confusion_dictionary, positive = "Curry")
Confusion Matrix and Statistics

                        true_classification
predicted_classification Curry Not Curry
               Curry        95       179
               Not Curry   195      8915
                                        
               Accuracy : 0.9601        
                 95% CI : (0.956, 0.964)
    No Information Rate : 0.9691        
    P-Value [Acc > NIR] : 1.000         
                                        
                  Kappa : 0.3164        
                                        
 Mcnemar's Test P-Value : 0.438         
                                        
            Sensitivity : 0.32759       
            Specificity : 0.98032       
         Pos Pred Value : 0.34672       
         Neg Pred Value : 0.97859       
             Prevalence : 0.03090       
         Detection Rate : 0.01012       
   Detection Prevalence : 0.02920       
      Balanced Accuracy : 0.65395       
                                        
       'Positive' Class : Curry         
                                        

\[\text{Accuracy} = \frac{\#\text{True Positives} + \#\text{True Negatives}}{\# \text{Observations} }\]

\[\text{Sensitivity} = \frac{\#\text{True Positives}}{\# \text{True Positives} + \# \text{False Negatives} }\] \[\text{Specificity} = \frac{\#\text{True Negatives}}{\# \text{True Negative} + \# \text{False Positives} }\]

Implication:

  • We can pick up some signal with the dictionary, but we are not doing a great job of classifying curries
  • Sensitivity (very low): Of all the real curries, how many did we correctly catch?
  • Specificity: Of all the non-curries, how many did we correctly leave alone?
  • We need methods that are better at working out the relationships between words and categories

Supervised Learning for Text

Supervised Learning vs Dictionaries

Supervised learning methods classify documents into pre-defined categories on the basis of the words they contain.

  • Supervised learning can be conceptualized as a generalization of dictionary methods

  • Dictionaries: (We decide in advance which words = curry)

    • Words associated with each category are pre-specified by the researcher
    • Words typically have a weight of either zero or one
    • Documents are scored on the basis of words they contain
  • Supervised learning: (we show the computer many examples of curries and non-curries, and it learns which words actually matter)

    • Words are associated with categories on the basis of pre-labelled training data
    • Words are weighted according to their relative prevalence in each each category
    • Documents are scored on the basis of words they contain

Surpervised Learning vs Dictionaries key differences

  • The key difference is that in supervised learning the features associated with each category (and their relative weight) are learned from the data

  • A major advantage of supervised learning methods is that the weights we estimate are specific to the corpus with which we are working (not true generally of dictionaries)

  • Supervised learning methods will often outperform dictionary methods in classification tasks, particularly when the training sample is large

Components of Supervised Learning

  • Labelled dataset

    • Labelled (normally hand-coded) data which categorizes texts into different categories
    • Training set: used to train the classifier
    • Test set: used to validate the classifier
  • Classification method

    • Statistical method to:

      • learn the relationship between coded texts and words
      • predict unlabeled documents from the words they contain
    • Examples: Naive Bayes, Logistic Regression, SVM, tree-based methods, many others…

  • Validation method

    • Predictive metrics such as confusion matrix, accuracy, sensitivity, specificity, etc
    • Normally we use a specific type of validation known as cross-validation
  • Out-of-sample prediction

    • Use the estimated statistical model to predict categories for documents that do not have labels

Creating a labelled datset

How do we obtain a labelled set?

  • External sources of annotation, e.g.

    • Party labels for election manifestos
    • Disputed authorship of Federalist papers estimated based on known authors of other documents
  • Expert annotation, e.g.

    • In many projects, undergraduate students (“expertise” comes from training)
    • Existing expert annotations, e.g. Comparative Manifesto Project
  • Crowd-sourced coding, e.g.

    • Ask random people on the internet to code texts into categories
    • Tends to rely on the “wisdom of crowds” hypothesis: aggregated judgments of non-experts converge to judgments of experts at much lower cost

For the purposes of the running example, we are cheating a bit 🫣 by assuming that any dish whose title contains the word “curry” is, in fact, a curry.

In a more serious application, we would hand-code individual curry recipes as “curry” or “not curry”, but we are taking a short-cut here.

Naive Bayes Classification

Language Models

A language model is any model that represents text data probabilistically, learning patterns in how words appear together in order to make predictions, classify text, or generate new content.

Throughout this course, we will focus on language models for:

  1. Classification

    • This week
  2. Latent representations (e.g Topic Models)

    • Next week
  3. Text sequences (Word order matters)

    • Week 8 and 9
  4. Text generation (e.g LLM)

    • Week 10

Language Models

  • Probabilistic language models describe a story about how documents are generated using probability
  • This data-generating process needs probabilities we don’t know yet; we estimate them by counting in real documents
  • Once we have estimated these parameters, we can reverse the process
  • Forward (Generative): Model → Generate documents
    • I have a recipe (model) → I bake a cake (generate document)
  • Reverse (Classification): Document → Which model generated it?
    • I taste a cake (see document) → Which recipe was used? Chocolate or vanilla? I can tell by comparing the taste to what each recipe would produce.
  • Instead of using the model to generate documents, we use it to evaluate which model best fits a document
  • We ask: Which language model would most likely have generated this document? (Each category has its own language model)
  • Naive Bayes is one example of a generative language model. In Naive Bayes, we:
    • Estimate a separate language model for each category
    • Compute how likely each text is under each model
    • Assign the text to the category with the highest probability

Language Models

  • The basis of any language model is a probability distribution over words in a vocabulary.

    • “How likely is this word to appear?”
  • A probability distribution over a discrete variable must have three properties

    • Non-negativity: Each element must be greater than or equal to zero
    • Upper bound: Each element must be less than or equal to one
    • Sum to 1: The sum of the elements must be 1

Language Models

  • Consider a 6 word vocabulary: “coriander🌿”, “turmeric🧡”, “garlic🧄”, “sugar🧁”, “flour🍞”, “eggs🥚”

  • When writing a curry recipe, you will

    • frequently use the words “coriander”, “turmeric”, and “garlic”
    • infrequently use the words “sugar”, “flour”, and “eggs”
  • When writing a cake recipe, you will

    • frequently use the words “sugar”, “flour”, and “eggs”
    • infrequently use the words “coriander”, “turmeric”, and “garlic”

How do we learn these patterns?

For example: We collect many real recipes of each type and simply count how often each word appears. If we read 100 curry recipes and count all the words in those recipes, we might find that “coriander” appears 400 times out of 10,000 total words (4%), “turmeric” appears 250 times (2.5%), and so on. These word frequency patterns tell us what’s typical for each recipe type.

Once we’ve counted words in real recipes, we can represent these different “models” for language using a probability distribution over the words in the vocabulary:

Model coriander turmeric garlic sugar flour eggs
μ_curry 0.4 0.25 0.20 0.08 0.04 0.03
μ_cake 0.02 0.01 0.01 0.26 0.4 0.3

Note that no word has a probability of 0 under either model

Language Models

Model coriander turmeric garlic sugar flour eggs
\(\mu_\text{curry}\) 0.4 0.25 0.20 0.08 0.04 0.03
\(\mu_\text{cake}\) 0.02 0.01 0.01 0.26 0.4 0.3
  • Given these models, we can calculate the probability that a given set of word counts (i.e. a document) would be drawn from each distribution

\[P(W_i|\mu) = \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_j^{W_{ij}}\]

  • This is the multinomial distribution
  • When applied to texts, the multinomial distribution describes the likelihood of seeing a specific combination of word counts in a document.
  • For instance, if we had a document that contains “coriander” 6 times, “turmeric” 2 times, and “garlic” 1 time, the multinomial allows us to calculate the probability of seeing those word counts under each model (\(\mu_\text{curry}\) and \(\mu_\text{cake}\))

Language Models

Model coriander turmeric garlic sugar flour eggs
\(\mu_\text{curry}\) 0.4 0.25 0.20 0.08 0.04 0.03
\(\mu_\text{cake}\) 0.02 0.01 0.01 0.26 0.4 0.3
  • Given these models, we can calculate the probability that a given set of word counts (i.e. a document) would be drawn from each distribution/model ==> \(P(W_i|\mu)\)

\[P(W_i|\mu) = \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_j^{W_{ij}}\]

  • \(\mu_j\) is the probability of observing word \(j\) under a given model
  • \(W_{i,j}\) is the number of times word \(j\) appears in document \(i\) (i.e. it is an element of a dfm)
  • \(M_i\) is the total number of words in document \(i\)
  • \(!\) is the factorial operator \((n! = n \times (n-1) \times (n-2) \times ... \times 1)\)
  • Ultimately, we want to determine which model best predicts a given document; is it more likely to have come from the curry model or the cake model?

Language Models

Model coriander turmeric garlic sugar flour eggs
\(\mu_\text{curry}\) 0.4 0.25 0.20 0.08 0.04 0.03
\(\mu_\text{cake}\) 0.02 0.01 0.01 0.26 0.4 0.3

Imagine we have two documents represented by the following DFM

Document coriander turmeric garlic sugar flour eggs
\(W_1\) 6 2 1 1 0 0
\(W_2\) 1 0 0 4 2 3

Which language model is most likely to have produced each document?

  • \(\mu_j\) is the probability of observing word \(j\) under a given model
  • \(W_{i,j}\) is the number of times word \(j\) appears in document \(i\) (i.e. it is an element of a dfm)
  • \(M_i\) is the total number of words in document \(i\)
  • \(!\) is the factorial operator \((n! = n \times (n-1) \times (n-2) \times ... \times 1)\)

\[P(W_1|\mu_\text{curry}) = \frac{M_i!}{\prod_{j=1}^JW_{1,j}!}\prod_{j=1}^J\mu_j^{W_{1,j}} = \frac{10!}{(6!)(2!)(1!)(1!)}\times(.4)^6\times(.25)^2\times(.2)^1\times(.08)^1 = 0.01\]

\[P(W_1|\mu_\text{cake}) = \frac{M_i!}{\prod_{j=1}^JW_{1,j}!}\prod_{j=1}^J\mu_j^{W_{1,j}} = \frac{10!}{(6!)(2!)(1!)(1!)}\times(.02)^6\times(.01)^2\times(.01)^1\times(.26)^1 = 0.000000000000042\]

Implication: The probability of observing \(W_1\) is higher under \(\mu_\text{curry}\) than under \(\mu_\text{cake}\).

\[P(W_2|\mu_\text{curry}) = \frac{M_i!}{\prod_{j=1}^JW_{2,j}!}\prod_{j=1}^J\mu_j^{W_{2,j}} = \frac{10!}{(1!)(4!)(2!)(3!)}\times(.4)^1\times(.26)^4\times(.4)^2\times(.3)^3 = 0.0000000089\]

\[P(W_2|\mu_\text{cake}) = \frac{M_i!}{\prod_{j=1}^JW_{2,j}!}\prod_{j=1}^J\mu_j^{W_{2,j}} = \frac{10!}{(1!)(4!)(2!)(3!)}\times(.02)^1\times(.26)^4\times(.4)^2\times(.3)^3 = 0.005\]

Implication: The probability of observing \(W_2\) is higher under \(\mu_\text{cake}\) than under \(\mu_\text{curry}\).

Conclusion: Given a set of probabilities, we can work out which model most likely generated any given document.

The likelihood of a document being generated by a given model will be

  • larger when the model gives higher probabilities to the words that occur frequently in the document
  • smaller when the model gives higher probabilities to the words that occur infrequently in the document

Why Do We Need Bayes’ Theorem?

You might notice something:

  • We did classification by comparing P(W|curry) vs P(W|cake) - The probability of W given curry or cake
  • That worked fine - we correctly identified which documents were associated with which recipes

Here’s the subtle issue:

  • What we compared: which model assigns higher probability to the document
  • What we technically want: which category has higher probability given the document
  • We compared P(W|curry) versus P(W|cake), We actually want P(curry|W) versus P(cake|W)

Why our simple comparison worked:

  • We were implicitly assuming both categories are equally common
  • This is called using “uniform priors”

What Bayes’ theorem adds:

  • Shows us what to do when categories aren’t equally common: If curry recipes are extremely rare (e.g 3% of recipes), we should require stronger word evidence before classifying it as curry.
  • Gives us the mathematical machinery to properly incorporate prior probabilities
  • Transforms P(W|category) into P(category|W)

Naive Bayes

  • Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]

\[{\color{violet}{P(y_i = C_k|W_i)}} = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)\color{violet}{P(W_i|y_i=C_k)}}{P(W_i)}\]

\[P(y_i = C_k|W_i) = \frac{{\color{violet}{P(y_i = C_k)}}P(W_i|y_i=C_k)}{P(W_i)}\]

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{\color{violet}{P(W_i)}}\]

  • \(\color{violet}{P(Y = C_k|W)}\) is the posterior distribution – this is what we actually want, it tells us the probability that \(W\) document \(i\) belongs in category \(k\), given the words in the document
  • \(\color{violet}{P(W|Y=C_k)}\) is the conditional probability or likelihood – this tells us the probability that we would observe the words in \(W_i\) if the document were from category \(k\). (E.g earlier examples)
  • \(\color{violet}{P(Y = C_k)}\) is the prior probability (base rate) that the document is from category \(k\) – this tells us the probability of the category of the document, absent any information about the words it contains.
    • 150 documents total in training set, 90 of them are labeled as curry recipes, P(curry) = 90/150.
  • \(\color{violet}{P(W_i)}\) is the unconditional probability of the words in document \(i\) – this tells us the probability that we would observe the words in \(W_i\) across all categories (how common these words are overall)

When categories aren’t equally common, Bayes’ theorem multiplies each likelihood by its prior probability before comparing. This adjusts our classification to account that some categories might be more prevalent.

Naive Bayes

\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\] So when we compare:

\[ \frac{P(C_{\text{curry}})\,P(W_i \mid C_{\text{curry}})}{P(W_i)} \;\;\text{vs}\;\; \frac{P(C_{\text{not curry}})\,P(W_i \mid C_{\text{not curry}})}{P(W_i)} \]

  • Our goal in classification is to compare categories for the same document

    • e.g. is
      \(P(y_i = C_{\text{curry}} \mid W_i) > P(y_i = C_{\text{not curry}} \mid W_i)\)?
  • Notice that the denominator \(P(W_i)\) is the same for all categories
  • Therefore, when comparing classes, \(P(W_i)\) does not affect the ranking, we can safely ignore it

\[ P(y_i = C_k \mid W_i) \propto P(y_i = C_k)\,P(W_i \mid y_i = C_k) \]

  • where \(\propto\) means “proportional to”

    • We don’t need the exact probability numbers (e.g 2/5 vs 3/5), they change, but their ordering does not

Naive Bayes

\[P(y_i = C_k|W_i) \propto P(y_i = C_k)P(W_i|y_i=C_k)\]

To work out the whether a document should be labelled as belonging to a particular class, we need:

  • the prior probability (\(\color{violet}{P(Y = C_k)}\)) that the document is from category \(k\)

    • This is usually estimated by calculating the proportion of documents of category \(k\) in the training data
  • the conditional probability or likelihood (\(\color{violet}{P(W|Y=C_k)}\)) of the words in the document occuring in category \(k\)

    • We already know that we can calculate this probability from the multinomial distribution!
    • Again, because we are only interested in the relative probabilities of different classes, we can drop the multinomial coefficient

\[\begin{eqnarray} P(W_i|y_i = C_k) &=& \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_{j(k)}^{W_{ij}}\\ &\propto&\prod_{j=1}^J\mu_{j(k)}^{W_{ij}} \end{eqnarray}\]

Naive Bayes: zooming in on the conditional probability

\[P(y_i = C_k|W_i) \propto P(y_i = C_k){\color{violet}{P(W_i|y_i=C_k)}}\] Because we are only interested in the relative probabilities of different classes, we can drop the multinomial coefficient

\[\begin{eqnarray} P(W_i|y_i = C_k) &=& \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_{j(k)}^{W_{ij}}\\ &\propto&\prod_{j=1}^J\mu_{j(k)}^{W_{ij}} \end{eqnarray}\]

  • If doc A had 6 x “coriander”, 2 x”turmeric”, 1 x “garlic” (9 words total).
  • The multinomial coefficient would be \(\frac{9!}{6!2!1!} = 0.25\). (Remember \({M_i!}\) is total words in doc, \(W_{i,j}!\) is the number of times word \(j\) appears in document \(i\))
  • When comparing curry versus cake models, both calculations include this exact same 0.25 multiplier.
  • The multinomial coefficient depends only on the document’s word counts, not on which category model we’re using. This cancels out.

Naive Bayes (we’re back!)

\[P(y_i = C_k|W_i) \propto P(y_i = C_k)P(W_i|y_i=C_k)\]

To work out the whether a document should be labelled as belonging to a particular class, we need:

  • the prior probability (\(\color{violet}{P(Y = C_k)}\)) that the document is from category \(k\)

    • This is usually estimated by calculating the proportion of documents of category \(k\) in the training data
  • the conditional probability or likelihood (\(\color{violet}{P(W|Y=C_k)}\)) of the words in the document occuring in category \(k\)

    • We already know that we can calculate this probability from the multinomial distribution!
    • Again, because we are only interested in the relative probabilities of different classes, we can drop the multinomial coefficient

\[\begin{eqnarray} P(W_i|y_i = C_k) &=& \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_{j(k)}^{W_{ij}}\\ &\propto&\prod_{j=1}^J\mu_{j(k)}^{W_{ij}} \end{eqnarray}\]

Question: How do we estimate \(\mu\)?

Naive Bayes Estimation

  • \(\mu_{j(k)}\) is the probability that word \(j\) will occur in documents of category \(k\).
  • We can estimate these probabilities from our training data:

\[\hat{\mu}_{j(k)} = \frac{\color{violet}{W_{j(k)}}}{\color{darkred}{\sum_{j\in V}W_{j(k)}}} = \frac{\text{number of times j appears in category k}}{\text{total number of words in category k}}\]

Example:

  • In the curry recipes our training data, we observe…

    • …77 instances of the word “turmeric” (\(\color{violet}{W_{\text{turmeric}(\text{curry})}} = \color{violet}{77}\))
    • …10586 total words (\(\color{darkred}{\sum_{j\in V}W_{j(\text{curry})}} = \color{darkred}{10586}\))
    • …and so \(\hat{\mu}_{\text{turmeric},\text{curry}} = \frac{\color{violet}{W_{\text{turmeric}(\text{curry})}}}{\color{darkred}{\sum_{j\in V}W_{j(\text{curry})}}} = \frac{\color{violet}{77}}{\color{darkred}{10586}} = 0.007\)
  • In the not-curry recipes our training data, we observe…

    • …148 instances of the word “turmeric” (\(\color{violet}{W_{\text{turmeric}(\text{not curry})}} = \color{violet}{148}\))
    • …210805 total words (\(\color{darkred}{\sum_{j\in V}W_{j(\text{not curry})}} = \color{darkred}{210805}\))
    • …and so \(\hat{\mu}_{\text{turmeric},\text{not curry}} =\frac{\color{violet}{W_{\text{turmeric}(\text{not curry})}}}{\color{darkred}{\sum_{j\in V}W_{j(\text{not curry})}}} = \frac{\color{violet}{148}}{\color{darkred}{210805}} = 0.0007\)
  • The word “turmeric” is about 10 times more common in curry recipes than other recipes

Naive Bayes Estimation – Laplace Smoothing

  • What happens when a given word doesn’t appear at all for one of the classes in our training data?
  • Imagine that we never observe the word “duck” in the curry recipes in our training data

\[\hat{\mu}_{\text{duck},\text{curry}} =\frac{\color{violet}{W_{\text{duck}(\text{curry})}}}{\color{darkred}{\sum_{j\in V}W_{j(\text{curry})}}} = \frac{\color{violet}{0}}{\color{darkred}{10586}} = 0\]

  • Then, in our test data, we observe the following sentence:
> "For this curry you will need to coat the duck legs with 1 tsp ground turmeric"
  • Because we multiply together all the individual word probabilities when we calculate the probability of a sentence occurring in a category, we will get a probability of zero!
  • Solution (Add One or Laplace Smoothing): Add one to the counts for each word in each category

\[\hat{\mu}_{\text{duck},\text{curry}} =\frac{\color{violet}{W_{\text{duck}(\text{curry})}+1}}{\color{darkred}{\sum_{j\in V}(W_{j(\text{curry})}+1)}} = \frac{\color{violet}{1}}{\color{darkred}{10587}} = 0.00009\]

  • Small but non-zero probability: allows other words in the document to influence classification decisions.

Naive Bayes Classification

The classification decision made by the Naive Bayes model is simple: we assign document \(i\) to the category, \(k\), for which it has the highest posterior probability:

\[ \hat{Y}_i = \underset{k \in \{1,...,k\}}{\operatorname{argmax}} P(y_i = C_k) \times P(W_i|y_i = C_k) \]

where \(\underset{k \in \{1,...,k\}}{\operatorname{argmax}}\) means “which category, \(k\), has the maximum posterior probability”.

Intuition:

  • \(\hat{Y}_i\) is the predicted category for document \(i\)

  • Assign documents to categories when the probability of observing the words in that document are high given the probability distribution for that category (i.e. when \(P(W_i|y_i = C_k)\) is large ) (remember conditional probability captures: how well the document’s words match what’s typical for that category.)

  • Assign more documents to categories that contain more documents in the training data (i.e. when \(P(y_i = C_k)\) is large) (remember prior probability)

Why is Naive Bayes “Naive”?

By treating documents as bags of words we are assuming:

  • Conditional independence of word counts

    • Knowing a document contains one word doesn’t tell us anything about the probability of observing other words in that document
    • e.g. The fact that a recipe includes the word “turmeric” doesn’t make it any more or less likely that it will also include the word “coriander”
  • Positional independence of word counts

    • The position of a word within a document doesn’t give us any information about the category of that document
    • e.g. Whether the word “turmeric” appears early or late in the recipe has no effect on the probability of it being a curry
    • e.g. Whether the word “good” appears after the word “not” has no effect on the probability of it being a “positive” document

While this is a very simple model of language which is “wrong”, it is nevertheless useful for classification.

Despite its naive assumptions, Naive Bayes often performs well because words in different categories tend to occur in distinct patterns, even if they’re not truly independent.

Naive Bayes Application

nb_output <- textmodel_nb(x = recipe_dfm, 
                         y = recipe_dfm$curry,# label we want to predict, curry vs not curry
                         prior = "docfreq")
summary(nb_output)

Call:
textmodel_nb.dfm(x = recipe_dfm, y = recipe_dfm$curry, prior = "docfreq")

Class Priors:
(showing first 2 elements)
    Curry Not Curry 
   0.0309    0.9691 

Estimated Feature Scores:
              beef     boned    rolled     pint      red     wine  vinegar
Curry     0.001378 0.0014925 0.0001148 0.001148 0.011481 0.001607 0.001378
Not Curry 0.003304 0.0006107 0.0004031 0.003847 0.009619 0.007750 0.005154
            sugar  allspice      bay  leaves     thyme peppercorns  crushed
Curry     0.00620 0.0002296 0.002067 0.01378 0.0004592    0.003789 0.009070
Not Curry 0.01872 0.0003420 0.002821 0.01063 0.0048734    0.001826 0.005417
            english     dijon  mustard unsalted      room temperature      lard
Curry     0.0002296 0.0001148 0.005166 0.001493 0.0001148   0.0001148 0.0001148
Not Curry 0.0007023 0.0009832 0.002803 0.004953 0.0005741   0.0005924 0.0004153
             plain    flour   white    water   chilled     icing  chicken
Curry     0.004822 0.006085 0.00287 0.007003 0.0001148 0.0003444 0.005855
Not Curry 0.008611 0.014871 0.01042 0.006693 0.0005863 0.0023573 0.007035
              cut   pieces
Curry     0.01297 0.005511
Not Curry 0.01336 0.003786
  • The class priors represent the prior probability of a document belonging to a particular category, \(k\), before considering any of the words in the document. (n of documents in category \(k\)/total n docs)
  • The feature scores represent how likely a word is to occur in a document from a particular category, based on the training data (\(\mu_{j(k)}\) inside the likelihood term).

Naive Bayes Application

Recall that we are interested in the probability of observing word \(j\) given class \(k\), i.e. 

\[\mu_{j(k)} = \frac{W_{j(k)}}{\sum_{j\in V}W_{j(k)}}\]

What are these word probabilities for our curry data?

We can examine the probability of each word given each class using the coef() function on the nb_train object.

head(coef(nb_output))
              Curry    Not Curry
beef   0.0013777268 0.0033038975
boned  0.0014925373 0.0006107019
rolled 0.0001148106 0.0004030633
pint   0.0011481056 0.0038474222
red    0.0114810563 0.0096185556
wine   0.0016073479 0.0077498076

Naive Bayes Application

Words with highest probability in the “curry” class (i.e. \(P(w_j|c_k = \text{``curry''})\)):

head(sort(coef(nb_output)[,1], decreasing = TRUE), 20)
      seeds      finely   coriander      peeled      garlic   vegetable 
0.030080367 0.023191734 0.020551091 0.018025258 0.016647532 0.015269805 
     ginger      cloves      leaves       green       cumin         cut 
0.015154994 0.014695752 0.013777268 0.013662457 0.013547646 0.012973594 
     powder      chilli         red    turmeric       onion       piece 
0.012973594 0.012743972 0.011481056 0.011136625 0.010332951 0.010103330 
     sliced       large 
0.009644087 0.009529277 

Words with highest probability in the “not curry” class (i.e. \(P(w_j|c_k = \text{``not curry''})\)):

head(sort(coef(nb_output)[,2], decreasing = TRUE), 20)
     finely       sugar       flour      sliced      garlic         cut 
0.021417317 0.018724122 0.014870592 0.014498064 0.013551476 0.013362158 
     peeled   freerange      leaves       white       juice      caster 
0.013301088 0.013252232 0.010632321 0.010424682 0.010296435 0.009844515 
      extra       large         red         egg       small       plain 
0.009807873 0.009630770 0.009618556 0.008659754 0.008653647 0.008610897 
      onion   vegetable 
0.008531506 0.008317760 

Naive Bayes Application

We extract the ingredients (words) that appear in that one recipe, then look up Naive Bayes’ learned probabilities

What are the class-conditional word probabilities for “Aromatic blackeye bean curry”?

          P(w|curry) P(w|not curry)
seeds          0.030          0.008
finely         0.023          0.021
coriander      0.021          0.005
peeled         0.018          0.013
garlic         0.017          0.014
ginger         0.015          0.005
cloves         0.015          0.008
leaves         0.014          0.011
cumin          0.014          0.002
chilli         0.013          0.006
onion          0.010          0.009
piece          0.010          0.002

What are the class-conditional word probabilities for “Schichttorte”?

          P(w|curry) P(w|not curry)
large          0.010          0.010
sugar          0.006          0.019
flour          0.006          0.015
paste          0.006          0.001
plain          0.005          0.009
lemon          0.004          0.008
freerange      0.003          0.013
eggs           0.002          0.008
zest           0.002          0.005
unsalted       0.001          0.005
caster         0.001          0.010
cornflour      0.000          0.001

Naive Bayes Application

Which recipes are predicted to have a high curry probability?

#Applying the trained Naive Bayes model to classify all the recipes
recipe_dfm$curry_nb_probability <- predict(nb_output, 
                                           type = "probability")

recipe_dfm$recipe_name[order(recipe_dfm$curry_nb_probability[,1], decreasing = T)[1:10]]
 [1] "Bengali butternut squash with chickpeas"        
 [2] "Chickpea curry with green mango and pomegranate"
 [3] "Green coconut fish curry"                       
 [4] "Thai green prawn curry"                         
 [5] "Rogan josh"                                     
 [6] "Bengal coconut dal"                             
 [7] "Tom yum soup"                                   
 [8] "Thai-style duck red curry"                      
 [9] "Peppery hot cabbage salad"                      
[10] "Peppery hot cabbage salad"                      

Which recipes are predicted to have a low curry probability?

recipe_dfm$recipe_name[order(recipe_dfm$curry_nb_probability[,1], decreasing = F)[1:10]]
 [1] "Sticky toffee apple pudding with calvados caramel sauce"
 [2] "Rich moist all-purpose fruit cake"                      
 [3] "Mini stollen "                                          
 [4] "Chocolate fruit cake"                                   
 [5] "Pheasant pithiviers"                                    
 [6] "Spiced poached pears with chocolate pudding"            
 [7] "Traditional Christmas pudding with brandy butter"       
 [8] "Intense chocolate cookies"                              
 [9] "Cookies and cream fudge brownies"                       
[10] "Bonfire night brioche"                                  

Was #TheStew really #TheCurry?

  • The purpose of training a classification model is to make out-of-sample predictions

  • Generally, we have a small hand-coded training dataset and then we predict for lots of other documents

  • Here, we are only predicting for one out-of-sample observation

ingredients <- c("cup olive oil, plus more for serving garlic cloves, chopped large yellow onion, chopped (2-inch) piece ginger, finely chopped Kosher salt and black pepper teaspoons ground turmeric, plus more for serving teaspoon red-pepper flakes, plus more for serving (15-ounce) cans chickpeas, drained and rinsed (15-ounce) cans full-fat coconut milk cups vegetable or chicken stock bunch Swiss chard, kale or collard greens, stems removed, torn into bite-size pieces cup leaves, mint for serving Yogurt, for serving (optional) Toasted pita, lavash or other flatbread, for serving (optional)")

dfm_stew <- tokens(ingredients) %>%
            dfm() %>%
            #function that aligns the vocabularies
            dfm_match(features = featnames(recipe_dfm))


predict(nb_output, newdata = dfm_stew, type = "probability")
       
docs        Curry  Not Curry
  text1 0.9611718 0.03882815

According to this model, it is a yes

Advantages and Disadvantages of Naive Bayes

Advantages

  • Fast

    • Takes seconds to compute, even for very large vocabularies/corpuses
  • Easy to apply

    • One line of code in quanteda
  • Can easily be extended to include…

    • … multiple categories
    • … different text representations (bigrams, tri-grams etc)

Advantages and Disadvantages of Naive Bayes

Disadvantages

  • Independence assumption

    • Independence means NB is unable to account for interactions between words

      • e.g. When the word “eggs” appears with the word “sugar” that should indicate something different from when “eggs” appears without the word “sugar”
    • Independence also means that NB is often overconfident

      • Each additional word counts as a new piece of information
    • In some contexts, the independence assumption can decrease predictive accuracy

  • Linear classifier

    • Other methods (e.g. SVM) allow the classification probabilities to change non-linearly in the word counts
    • The first occurrence matters a lot, additional occurrences matter less and less
    • Word combinations matter differently than single words

Break

Validating Supervised Learning Classifiers

Validating Supervised Learning Classifiers

  • How can we assess the classification performance of our supervised learning classifier?

  • Our goal is to measure the degree to which the predictions we make correspond to the observed data

  • We have already seen some ways to do this

    • Accuracy – the proportion of all predictions that match the observed data
    • Sensitivity – the proportion of “true positive” predictions that match the observed data
    • Specificity – the proportion of “true negative” predictions that match the observed data
  • In order to get informative estimates of these quantities, we need to distinguish between the performance of the classifier on the training set and the test set

Training Error versus Test error

  • The test error is the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in training the method.

  • In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training.

  • Training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.

Bias-Variance Trade-Off

We can think of the test error associated with any given statistical estimator as coming from two fundamental quantities:

  1. Bias

    • the error that is introduced by approximating a complicated set of relationships with a simple model that doesn’t characterise the full complexity
  2. Variance

    • the amount that the predictions produced by the estimator would change if it had been estimated on different data. (too sensitive to the specific training data)

Ideally we would like to minimize both variance and bias, but these goals are often at odds with each other.

High Bias, Low Variance Example:If “America” appears → domestic policy; otherwise → foreign policy

  • High bias: This model is too simple to capture the true complexity
    • Many domestic policy speeches don’t mention “America” (healthcare, education reforms)
    • The model is systematically wrong because it oversimplifies the problem
  • Low variance: The model is very stable across different training datasets
    • Train on any collection of speeches → you’ll always get the same “America” rule
    • Predictions are consistent, even if consistently wrong

Training- versus Test-Set Performance

  • A very complex model can essentially memorize every single training example, achieving near-perfect accuracy on the training set.
  • Training-specific patterns don’t exist in the test data, so they don’t help prediction and actually hurt it
  • When the model is moderately complex (middle of the graph), it learns the real underlying patterns that are common to both training and test data
  • Implication: We need tools which tell us when we have reached the optimal balance between bias and variance.

Test-set approach

Naive Bayes Application

Before we train a model, we need to separate our data into a training set and a test set:

## Training and test set

#each recipe independently has an 80% probability of getting TRUE (training) and 20% probability of getting FALSE (test)
train <- sample(c(TRUE, FALSE), nrow(recipes), replace = TRUE, prob = c(.8, .2))
test <- !train
table(train)
train
FALSE  TRUE 
 1877  7507 
table(test)
test
FALSE  TRUE 
 7507  1877 

How many curry recipes are there in the training and test sets?

## Training and test set

prop.table(table(recipes$curry[train]))

     Curry  Not Curry 
0.03157053 0.96842947 
prop.table(table(recipes$curry[test]))

     Curry  Not Curry 
0.02823655 0.97176345 

Naive Bayes Application

We then subset the recipe_dfm object into a training dfm and a test dfm:

## Naive Bayes

recipe_dfm_train <- dfm_subset(recipe_dfm, train)
recipe_dfm_test <- dfm_subset(recipe_dfm, test)

We then train our Naive Bayes model on the training set:

nb_train <- textmodel_nb(x = recipe_dfm_train, 
                         y = recipe_dfm_train$curry,
                         prior = "docfreq")

And finally, we predict the category of each recipe in the test set:

recipe_dfm_test$predicted_curry_nb <- predict(nb_train,
                                                newdata = recipe_dfm_test,
                                                type = "class")

Naive Bayes Classification Perfomance

confusion_nb <- table(predicted_classification = recipe_dfm_test$predicted_curry_nb,
                      true_classification = recipe_dfm_test$curry)
library(caret)

confusionMatrix(confusion_nb, positive = "Curry")
Confusion Matrix and Statistics

                        true_classification
predicted_classification Curry Not Curry
               Curry        38       101
               Not Curry    15      1723
                                          
               Accuracy : 0.9382          
                 95% CI : (0.9263, 0.9487)
    No Information Rate : 0.9718          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3701          
                                          
 Mcnemar's Test P-Value : 2.973e-15       
                                          
            Sensitivity : 0.71698         
            Specificity : 0.94463         
         Pos Pred Value : 0.27338         
         Neg Pred Value : 0.99137         
             Prevalence : 0.02824         
         Detection Rate : 0.02025         
   Detection Prevalence : 0.07405         
      Balanced Accuracy : 0.83080         
                                          
       'Positive' Class : Curry           
                                          

Implication:

Relative to the dictionary approach we are…

  • …doing a better job on predicting true positives now (our sensitivity is much higher)

  • …predicting too many curries that are actually something else (our specificity is a little lower)

Training-Set and Test-Set Performance

  • The test set and training set accuracy can be very different

  • As a model becomes more flexible…

    • …the training set accuracy will almost always increase
    • …the test set accuracy will sometimes decrease
  • Imagine that we include a very large number of features in our dfm

    • All unigrams, all bi-grams, …, all 5-grams
    • Total number of features \(\approx\) 300k features
  • How does the training/test set accuracy change as we increase the number of features used to train the classifier?

  • Because more features make the model better at memorising the training data, but also more sensitive to noise, which hurts performance on new (test) data.

Training-Set and Test-Set Accuracy

Increasing model complexity improves training accuracy, but test accuracy peaks and then declines due to overfitting.

Overfitting and Test-Set Accuracy

  • Question: Why does the test-set accuracy decrease when we add additional features?

  • Answer: Because we are now overfitting our data.

  • Overfitting occurs when we find relationships between words (or n-grams) and curries in our training data that do not generalise to our test data

  • In this example, there are some n-gram phrases that appear frequently in the curry recipes in our training set but which never appear in our test-set curry recipes

Feature Training Test
mustard_seeds_tsp 18 0
tsp_black_mustard 15 0
tsp_black_mustard_seeds 15 0
leaves_and_stalks 20 0
black_mustard_seeds_tsp 10 0
coriander_leaves_and 14 0
cumin_seeds_tsp_black 7 0
coriander_leaves_and_stalks 13 0
large_garlic_cloves 9 0
chopped_garlic_cloves_peeled_and 15 0

Test-Set Validation for Feature Selection

  • We can use the test-set performance statistics to select between model specifications

  • We will compare the accuracy, sensitivity and specificity for the following models:

    • Our “original” model (unigrams, no stopwords, trimmed)
    • A “raw” model (unigrams, nothing removed)
    • A “no stopwords” model (unigrams, stopwords removed)
    • A “trimmed” model (unigrams, trimmed)
    • An “n-gram” model (unigrams, bigrams, trigrams)
    • An “n-gram, trimmed” model (unigrams, bigrams, trigrams, words occuring fewer than 10 times discarded)
  • The “best” model is the one which has the highest classification scores

Test-Set Validation for Feature Selection

Test-set validation
Model Accuracy Sensitivity Specificity N features
Original 0.94 0.78 0.95 902
Raw 0.96 0.65 0.97 4214
No stop words 0.96 0.66 0.97 4126
Trimmed 0.94 0.79 0.95 1339
N-gram 0.98 0.51 1 152215
N-gram, trimmed 0.94 0.83 0.94 6072
  • The “n-gram” model has the highest accuracy, but has very low sensitivity

  • The “n-gram, trimmed” model outperforms all other models in sensitivity

Cross-Validation

  • To calculate the test-set accuracy we randomly allocated observations to the test and training sets

  • If we repeat this process with a new randomization, we will get different test-set performance scores

Test-set validation
Model Accuracy Sensitivity Specificity N features
Original 0.94 0.78 0.95 902
Raw 0.96 0.65 0.97 4214
No stop words 0.96 0.66 0.97 4126
Trimmed 0.94 0.79 0.95 1339
N-gram 0.98 0.51 1 152215
N-gram, trimmed 0.94 0.83 0.94 6072
Test-set validation
Model Accuracy Sensitivity Specificity N features
Original 0.94 0.77 0.95 902
Raw 0.96 0.66 0.97 4214
No stop words 0.96 0.66 0.97 4126
Trimmed 0.94 0.78 0.94 1339
N-gram 0.98 0.45 1 152215
N-gram, trimmed 0.93 0.83 0.94 6072
Test-set validation
Model Accuracy Sensitivity Specificity N features
Original 0.94 0.78 0.95 902
Raw 0.96 0.65 0.97 4214
No stop words 0.96 0.66 0.97 4126
Trimmed 0.94 0.79 0.95 1339
N-gram 0.98 0.51 1 152215
N-gram, trimmed 0.94 0.83 0.94 6072
Test-set validation
Model Accuracy Sensitivity Specificity N features
Original 0.94 0.78 0.95 902
Raw 0.96 0.64 0.97 4214
No stop words 0.96 0.69 0.97 4126
Trimmed 0.94 0.78 0.95 1339
N-gram 0.98 0.51 1 152215
N-gram, trimmed 0.94 0.83 0.94 6072
  • The simple validation approach suffers from two weaknesses:

    1. Estimates of test-set accuracy can be highly variable
    2. We are only using a subset of the data to train the model (the observations in the training set)

Implication: We need a method that uses all data for training and generates more stable test-set accuracy.

K-fold Cross-Validation

  • Cross-validation is an alternative to a simple train-test split

  • This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size

    • Typical choices are \(k=5\) or \(k=10\)
  • For each of the \(k\) folds we do the following

    1. Train the Naive Bayes model on all observations not included in the fold
    2. Generate predictions for the observations in the fold
    3. Calculate the accuracy etc of the predictions for the observations in the held-out fold
  • We then calculate the performance metrics by averaging over those computed on each fold

K-fold Cross-Validation Application

# "held_out" is a logical vetor of true and false values
get_performance_scores <- function(held_out){
  
  # Set up train and test sets for this fold
  recipe_dfm_train <- dfm_subset(recipe_dfm, !held_out)
  recipe_dfm_test <- dfm_subset(recipe_dfm, held_out)
  
  # Train model on everything except held-out fold
  nb_train <- textmodel_nb(x = recipe_dfm_train, 
                         y = recipe_dfm_train$curry,
                         prior = "docfreq")
  
  # Predict for held-out fold
  recipe_dfm_test$predicted_curry <- predict(nb_train, 
                                             newdata = recipe_dfm_test, 
                                             type = "class")
  
  # Calculate accuracy, specificity, sensitivity
  confusion_nb <- table(predicted_classification = recipe_dfm_test$predicted_curry,
                        true_classification = recipe_dfm_test$curry)
  
  confusion_nb_statistics <- confusionMatrix(confusion_nb, positive = "Curry")
  
  accuracy <- confusion_nb_statistics$overall[1]
  sensitivity <- confusion_nb_statistics$byClass[1]
  specificity <- confusion_nb_statistics$byClass[2]
  
  return(data.frame(accuracy, sensitivity, specificity))
  
}

K-fold Cross-Validation Application

K <- 5
folds <- sample(1:K, nrow(recipe_dfm), replace = T)
get_performance_scores(folds == 1)
          accuracy sensitivity specificity
Accuracy 0.9418182    0.754386   0.9475375
all_folds <- lapply(1:5, function(k) get_performance_scores(folds == k))
all_folds
[[1]]
          accuracy sensitivity specificity
Accuracy 0.9418182    0.754386   0.9475375

[[2]]
          accuracy sensitivity specificity
Accuracy 0.9389356   0.6923077   0.9463358

[[3]]
         accuracy sensitivity specificity
Accuracy 0.935911   0.7666667   0.9414661

[[4]]
          accuracy sensitivity specificity
Accuracy 0.9420829   0.7704918   0.9478309

[[5]]
          accuracy sensitivity specificity
Accuracy 0.9380252   0.7166667   0.9452278
colMeans(bind_rows(all_folds))
   accuracy sensitivity specificity 
  0.9393546   0.7401038   0.9456796 

Cross-Validation for Model Selection

5-fold cross-validation
Model Accuracy Sensitivity Specificity
Original 0.94 0.74 0.95
Raw 0.96 0.6 0.97
No stop words 0.96 0.61 0.97
Trimmed 0.94 0.74 0.94
N-gram 0.96 0.21 0.99
N-gram, trimmed 0.93 0.76 0.93
10-fold cross-validation
Model Accuracy Sensitivity Specificity
Original 0.94 0.74 0.94
Raw 0.96 0.62 0.97
No stop words 0.96 0.64 0.97
Trimmed 0.94 0.74 0.94
N-gram 0.96 0.26 0.98
N-gram, trimmed 0.93 0.77 0.93

Cross-Validation Uses

Cross-validation is a very general strategy for evaluating predictive fit

  • Which variables should I use to predict my outcome?

  • Should I use a linear model, or a non-linear model?

  • Etc

Extensions and Use Cases

Extensions

Naive Bayes is only one supervised learning text-classification method

  • Regularized Logistic Regression

    • Directly models the probability that each document is in class \(k\) using logistic regression

    • Regularization required to prevent overfitting data

    • textmodel_lr in quanteda

  • Support Vector Machines

    • SVMs draw a hyperplane through the multidimensional word space that best separates documents into different classes

    • Can accomodate non-linear boundaries between classes

    • textmodel_svm() in quanteda

  • “Tree-based” Classification Methods

    • Tree-based methods separate classes by segmenting the predictors (word counts) into a number of distinct regions (sequence of yes/no questions)

    • The modal outcome for observations that fall within a given region becomes the predicted category for any observation in that region

    • Like the SVM, this allows for non-linear relationships between features and categories

    • tree package in R

Use Cases of Supervised Learning

Conclusion

Summing Up

  • Supervised learning for text data allows us to learn the association between words and particular outcome categories

  • The Naive Bayes model is a simple model that is fast to implement and which, despite some strong assumptions, tends to provide good classification results

  • Once we have trained our supervised learning classifiers, it is important to validate their performance on a test-set that was not used to fit the model

  • Cross-validation is a general strategy for out-of-sample evaluation that can us to choose between different models