- Text as Data
- Dictionaries
- Similarity, Difference, and Complexity
- Supervised Learning for Text Data 👈
- Collecting Text Data
- Text Scaling Models
- Topic Models
- Word Embeddings
- Causal Inference with Text
- Review and Large Language Models
Oxford English Dictionary:
“A preparation of meat, fish, fruit, or vegetables, cooked with a quantity of bruised spices and turmeric, and used as a relish or flavouring, esp. for dishes composed of or served with rice. Hence, a curry = a dish or stew (of rice, meat, etc.) flavoured with this preparation (or with curry-powder).”
If a curry can be defined by the spices a dish contains, then we ought to be able to predict whether a recipe is a curry from ingredients listed in recipes
We will evaluate the probability that #TheStew is a curry by training a curry classifier on a set of recipes
We will use data on 9384 recipes from the BBC recipe archive
This data includes information on
Our data includes information on each recipe:
[1] "Place the rib-eye of beef into a large non-metallic dish. In a jug, mix together the red wine, vinegar, sugar, allspice, bay leaf and half of the thyme until well combined. Pour the mixture over the beef, turning to coat the joint evenly in the liquid. Cover the dish loosely with cling film and set aside to marinate in the fridge for at least four hours, turning occasionally. (The beef can be marinated for up to two days.) When the beef is ready to cook, preheat the oven to 190C/375F/Gas 5. Lift the beef from the marinade, allowing any excess liquid to drip off, and place on a plate, loosely covered, until the meat has returned to room temperature. Sprinkle the crushed peppercorns and the remaining thyme onto a plate. Spread the mustard evenly all over the surface of the beef, then roll the beef in the peppercorn and thyme mixture to coat. Place the crusted beef into a roasting tin and roast in the oven for 1 hour 20 minutes (for medium-rare) or 1 hour 50 minutes (for well-done). Meanwhile, for the horseradish cream, mix the crème frâiche, creamed horseradish, mustard and chives together in a bowl until well combined. Season, to taste, with salt and freshly ground black pepper, then spoon into a serving dish and chill until needed. When the beef is cooked to your liking, transfer to a warmed platter and cover with aluminium foil, then set aside to rest in a warm place for 25-30 minutes. To serve, carve the rib-eye of beef into slices and arrange on warmed plates. Spoon the roasted root vegetables alongside. Serve with the horseradish cream."
[1] "Venison massaman curry" "Almond and cauliflower korma curry"
[3] "Aromatic beef curry" "Aromatic blackeye bean curry"
[5] "Aubergine curry" "Bangladeshi venison curry"
Given that we have some idea of the concept we would like to measure, perhaps we can just use a dictionary:
## Convert to corpus
recipe_corpus <- corpus(recipes, text_field = "ingredients")
# Tokenize
recipe_tokens <- tokens(recipe_corpus, remove_punct = TRUE,
remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(c(stopwords("en"),
"ml","fl","x","mlâ","mlfl","g","kglb",
"tsp","tbsp","goz","oz", "glb", "gâ", "â"))
# Convert to DFM
recipe_dfm <- recipe_tokens %>%
dfm() %>%
dfm_trim(max_docfreq = .3,
min_docfreq = .002,
docfreq_type = "prop")
topfeatures(recipe_dfm, 20)
[1] "Indonesian stir-fried rice (Nasi goreng)"
[2] "Pineapple, prawn and scallop curry"
[3] "Almond and cauliflower korma curry"
[4] "Aloo panchporan (Stir-fried potatoes tempered with five spices)"
[5] "Aromatic beef curry"
[6] "Asian-spiced rice with coriander-crusted lamb and rosemary oil"
[7] "Beef chilli flash-fry with yoghurt rice"
[8] "Beef rendang with mango chutney and sticky rice"
[9] "Beef curry with jasmine rice"
[10] "Beef Madras"
Let’s classify a recipe as a “curry” if it includes any of our dictionary words
Confusion Matrix and Statistics
true_classification
predicted_classification Curry Not Curry
Curry 95 179
Not Curry 195 8915
Accuracy : 0.9601
95% CI : (0.956, 0.964)
No Information Rate : 0.9691
P-Value [Acc > NIR] : 1.000
Kappa : 0.3164
Mcnemar's Test P-Value : 0.438
Sensitivity : 0.32759
Specificity : 0.98032
Pos Pred Value : 0.34672
Neg Pred Value : 0.97859
Prevalence : 0.03090
Detection Rate : 0.01012
Detection Prevalence : 0.02920
Balanced Accuracy : 0.65395
'Positive' Class : Curry
\[\text{Accuracy} = \frac{\#\text{True Positives} + \#\text{True Negatives}}{\# \text{Observations} }\]
\[\text{Sensitivity} = \frac{\#\text{True Positives}}{\# \text{True Positives} + \# \text{False Negatives} }\] \[\text{Specificity} = \frac{\#\text{True Negatives}}{\# \text{True Negative} + \# \text{False Positives} }\]
Implication:
Supervised learning methods classify documents into pre-defined categories on the basis of the words they contain.
Supervised learning can be conceptualized as a generalization of dictionary methods
Dictionaries:
Supervised learning:
The key difference is that in supervised learning the features associated with each category (and their relative weight) are learned from the data
A major advantage of supervised learning methods is that the weights we estimate are specific to the corpus with which we are working (not true generally of dictionaries)
Supervised learning methods will often outperform dictionary methods in classification tasks, particularly when the training sample is large
Labelled dataset
Classification method
Statistical method to:
Examples: Naive Bayes, Logistic Regression, SVM, tree-based methods, many others…
Validation method
Out-of-sample prediction
How do we obtain a labelled set?
External sources of annotation, e.g.
Expert annotation, e.g.
Crowd-sourced coding, e.g.
For the purposes of the running example, we are cheating a bit by assuming that any dish whose title contains the word “curry” is, in fact, a curry.
In a more serious application, we would hand-code individual curry recipes as “curry” or “not curry”, but we are taking a short-cut here.
Probabilistic language models describe a story about how documents are generated using probability
This data-generating process is based on a set of unknown parameters which we infer based on the data
Once we have inferred values for the parameters, we can reverse the data-generating process and calculate the probability that any given document was generated by a particular language model
The Naive Bayes text classification model is one example of a generative language model. In Naive Bayes:
The basis of any language model is a probability distribution over words in a vocabulary.
A probability distribution over a discrete variable must have three properties
When writing a curry recipe, you will
When writing a cake recipe, you will
Model | coriander | turmeric | garlic | sugar | flour | eggs |
---|---|---|---|---|---|---|
\(\mu_\text{curry}\) | 0.4 | 0.25 | 0.20 | 0.08 | 0.04 | 0.03 |
\(\mu_\text{cake}\) | 0.02 | 0.01 | 0.01 | 0.26 | 0.4 | 0.3 |
Model | coriander | turmeric | garlic | sugar | flour | eggs |
---|---|---|---|---|---|---|
\(\mu_\text{curry}\) | 0.4 | 0.25 | 0.20 | 0.08 | 0.04 | 0.03 |
\(\mu_\text{cake}\) | 0.02 | 0.01 | 0.01 | 0.26 | 0.4 | 0.3 |
\[P(W_i|\mu) = \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_j^{W_{ij}}\]
Model | coriander | turmeric | garlic | sugar | flour | eggs |
---|---|---|---|---|---|---|
\(\mu_\text{curry}\) | 0.4 | 0.25 | 0.20 | 0.08 | 0.04 | 0.03 |
\(\mu_\text{cake}\) | 0.02 | 0.01 | 0.01 | 0.26 | 0.4 | 0.3 |
Imagine we have two documents represented by the following DFM
Document | coriander | turmeric | garlic | sugar | flour | eggs |
---|---|---|---|---|---|---|
\(W_1\) | 6 | 2 | 1 | 1 | 0 | 0 |
\(W_2\) | 1 | 0 | 0 | 4 | 2 | 3 |
Which language model is most likely to have produced each document?
\[P(W_1|\mu_\text{curry}) = \frac{M_i!}{\prod_{j=1}^JW_{1,j}!}\prod_{j=1}^J\mu_j^{W_{1,j}} = \frac{10!}{(6!)(2!)(1!)(1!)}\times(.4)^6\times(.25)^2\times(.2)^1\times(.08)^1 = 0.01\]
\[P(W_1|\mu_\text{cake}) = \frac{M_i!}{\prod_{j=1}^JW_{1,j}!}\prod_{j=1}^J\mu_j^{W_{1,j}} = \frac{10!}{(6!)(2!)(1!)(1!)}\times(.02)^6\times(.01)^2\times(.01)^1\times(.26)^1 = 0.000000000000042\]
Implication: The probability of observing \(W_1\) is higher under \(\mu_\text{curry}\) than under \(\mu_\text{cake}\).
\[P(W_2|\mu_\text{curry}) = \frac{M_i!}{\prod_{j=1}^JW_{2,j}!}\prod_{j=1}^J\mu_j^{W_{2,j}} = \frac{10!}{(1!)(4!)(2!)(3!)}\times(.4)^1\times(.26)^4\times(.4)^2\times(.3)^3 = 0.0000000089\]
\[P(W_2|\mu_\text{cake}) = \frac{M_i!}{\prod_{j=1}^JW_{2,j}!}\prod_{j=1}^J\mu_j^{W_{2,j}} = \frac{10!}{(1!)(4!)(2!)(3!)}\times(.02)^1\times(.26)^4\times(.4)^2\times(.3)^3 = 0.005\]
Implication: The probability of observing \(W_2\) is higher under \(\mu_\text{cake}\) than under \(\mu_\text{curry}\).
Conclusion: Given a set of probabilities, we can work out which model was most likely to have generated any given document.
The likelihood of a document being generated by a given model will be
\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]
\[{\color{violet}{P(y_i = C_k|W_i)}} = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]
\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)\color{violet}{P(W_i|y_i=C_k)}}{P(W_i)}\]
\[P(y_i = C_k|W_i) = \frac{{\color{violet}{P(y_i = C_k)}}P(W_i|y_i=C_k)}{P(W_i)}\]
\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{\color{violet}{P(W_i)}}\]
\[P(y_i = C_k|W_i) = \frac{P(y_i = C_k)P(W_i|y_i=C_k)}{P(W_i)}\]
Generally, we will want to make comparisons of the probabilities between different classes
\[P(y_i = C_k|W_i) \propto P(y_i = C_k)P(W_i|y_i=C_k)\]
\[P(y_i = C_k|W_i) \propto P(y_i = C_k)P(W_i|y_i=C_k)\]
To work out the whether a document should be labelled as belonging to a particular class, we therefore need to work out:
the prior probability (\(\color{violet}{P(Y = C_k)}\)) that the document is from category \(k\)
the conditional probability or likelihood (\(\color{violet}{P(W|Y=C_k)}\)) of the words in the document occuring in category \(k\)
\[\begin{eqnarray} P(W_i|y_i = C_k) &=& \frac{M_i!}{\prod_{j=1}^JW_{i,j}!}\prod_{j=1}^J\mu_{j(k)}^{W_{ij}}\\ &\propto&\prod_{j=1}^J\mu_{j(k)}^{W_{ij}} \end{eqnarray}\]
Question: How do we estimate \(\mu\)?
\[\hat{\mu}_{j(k)} = \frac{\color{violet}{W_{j(k)}}}{\color{darkred}{\sum_{j\in V}W_{j(k)}}} = \frac{\text{number of times j appears in category k}}{\text{total number of words in category k}}\]
Example:
In the curry recipes our training data, we observe…
In the not-curry recipes our training data, we observe…
The word “turmeric” is about 10 times more common in curry recipes than other recipes
\[\frac{\color{violet}{W_{\text{duck}(\text{curry})}}}{\color{darkred}{\sum_{j\in V}W_{j(\text{curry})}}} = \frac{\color{violet}{0}}{\color{darkred}{10586}} = 0\]
> "For this curry you will need to coat the duck legs with 1 tsp ground turmeric"
\[\frac{\color{violet}{W_{\text{duck}(\text{curry})}+1}}{\color{darkred}{\sum_{j\in V}(W_{j(\text{curry})}+1)}} = \frac{\color{violet}{1}}{\color{darkred}{10587}} = 0.00009\]
By treating documents as bags of words we are assuming:
Conditional independence of word counts
Positional independence of word counts
While this is a very simple model of language which is “wrong”, it is nevertheless useful for classification.
The classification decision made by the Naive Bayes model is simple: we assign document \(i\) to the category, \(k\), for which it has the highest posterior probability:
\[ \hat{Y}_i = \underset{k \in \{1,...,k\}}{\operatorname{argmax}} P(y_i = C_k) \times P(W_i|y_i = C_k) \]
where \(\underset{k \in \{1,...,k\}}{\operatorname{argmax}}\) means “which category, \(k\), has the maximum posterior probability”.
Intuition:
Assign documents to categories when the probability of observing the words in that document are high given the probability distribution for that category (i.e. when \(P(W_i|y_i = C_k)\) is large)
Assign more documents to categories that contain more documents in the training data (i.e. when \(P(y_i = C_k)\) is large)
Call:
textmodel_nb.dfm(x = recipe_dfm, y = recipe_dfm$curry, prior = "docfreq")
Class Priors:
(showing first 2 elements)
Curry Not Curry
0.0309 0.9691
Estimated Feature Scores:
beef boned rolled pint red wine vinegar
Curry 0.001378 0.0014925 0.0001148 0.001148 0.011481 0.001607 0.001378
Not Curry 0.003304 0.0006107 0.0004031 0.003847 0.009619 0.007750 0.005154
sugar allspice bay leaves thyme peppercorns crushed
Curry 0.00620 0.0002296 0.002067 0.01378 0.0004592 0.003789 0.009070
Not Curry 0.01872 0.0003420 0.002821 0.01063 0.0048734 0.001826 0.005417
english dijon mustard unsalted room temperature lard
Curry 0.0002296 0.0001148 0.005166 0.001493 0.0001148 0.0001148 0.0001148
Not Curry 0.0007023 0.0009832 0.002803 0.004953 0.0005741 0.0005924 0.0004153
plain flour white water chilled icing chicken
Curry 0.004822 0.006085 0.00287 0.007003 0.0001148 0.0003444 0.005855
Not Curry 0.008611 0.014871 0.01042 0.006693 0.0005863 0.0023573 0.007035
cut pieces
Curry 0.01297 0.005511
Not Curry 0.01336 0.003786
Recall that we are interested in the probability of observing word \(j\) given class \(k\), i.e.
\[\mu_{j(k)} = \frac{W_{j(k)}}{\sum_{j\in V}W_{j(k)}}\]
What are these word probabilities for our curry data?
We can examine the probability of each word given each class using the coef()
function on the nb_train
object.
Words with highest probability in the “curry” class (i.e. \(P(w_j|c_k = \text{``curry''})\)):
seeds finely coriander peeled garlic vegetable
0.030080367 0.023191734 0.020551091 0.018025258 0.016647532 0.015269805
ginger cloves leaves green cumin cut
0.015154994 0.014695752 0.013777268 0.013662457 0.013547646 0.012973594
powder chilli red turmeric onion piece
0.012973594 0.012743972 0.011481056 0.011136625 0.010332951 0.010103330
sliced large
0.009644087 0.009529277
Words with highest probability in the “not curry” class (i.e. \(P(w_j|c_k = \text{``not curry''})\)):
finely sugar flour sliced garlic cut
0.021417317 0.018724122 0.014870592 0.014498064 0.013551476 0.013362158
peeled freerange leaves white juice caster
0.013301088 0.013252232 0.010632321 0.010424682 0.010296435 0.009844515
extra large red egg small plain
0.009807873 0.009630770 0.009618556 0.008659754 0.008653647 0.008610897
onion vegetable
0.008531506 0.008317760
What are the class-conditional word probabilities for “Aromatic blackeye bean curry”?
P(w|curry) P(w|not curry)
seeds 0.030 0.008
finely 0.023 0.021
coriander 0.021 0.005
peeled 0.018 0.013
garlic 0.017 0.014
ginger 0.015 0.005
cloves 0.015 0.008
leaves 0.014 0.011
cumin 0.014 0.002
chilli 0.013 0.006
onion 0.010 0.009
piece 0.010 0.002
What are the class-conditional word probabilities for “Schichttorte”?
P(w|curry) P(w|not curry)
large 0.010 0.010
sugar 0.006 0.019
flour 0.006 0.015
paste 0.006 0.001
plain 0.005 0.009
lemon 0.004 0.008
freerange 0.003 0.013
eggs 0.002 0.008
zest 0.002 0.005
unsalted 0.001 0.005
caster 0.001 0.010
cornflour 0.000 0.001
Which recipes are predicted to have a high curry probability?
recipe_dfm$curry_nb_probability <- predict(nb_output,
type = "probability")
recipe_dfm$recipe_name[order(recipe_dfm$curry_nb_probability[,1], decreasing = T)[1:10]]
[1] "Bengali butternut squash with chickpeas"
[2] "Chickpea curry with green mango and pomegranate"
[3] "Green coconut fish curry"
[4] "Thai green prawn curry"
[5] "Rogan josh"
[6] "Bengal coconut dal"
[7] "Tom yum soup"
[8] "Thai-style duck red curry"
[9] "Peppery hot cabbage salad"
[10] "Peppery hot cabbage salad"
Which recipes are predicted to have a low curry probability?
[1] "Sticky toffee apple pudding with calvados caramel sauce"
[2] "Rich moist all-purpose fruit cake"
[3] "Mini stollen "
[4] "Chocolate fruit cake"
[5] "Pheasant pithiviers"
[6] "Spiced poached pears with chocolate pudding"
[7] "Traditional Christmas pudding with brandy butter"
[8] "Intense chocolate cookies"
[9] "Cookies and cream fudge brownies"
[10] "Bonfire night brioche"
The purpose of training a classification model is to make out-of-sample predictions
Generally, we have a small hand-coded training dataset and then we predict for lots of other documents
Here, we are only predicting for one out-of-sample observation
ingredients <- c("cup olive oil, plus more for serving garlic cloves, chopped large yellow onion, chopped (2-inch) piece ginger, finely chopped Kosher salt and black pepper teaspoons ground turmeric, plus more for serving teaspoon red-pepper flakes, plus more for serving (15-ounce) cans chickpeas, drained and rinsed (15-ounce) cans full-fat coconut milk cups vegetable or chicken stock bunch Swiss chard, kale or collard greens, stems removed, torn into bite-size pieces cup leaves, mint for serving Yogurt, for serving (optional) Toasted pita, lavash or other flatbread, for serving (optional)")
dfm_stew <- tokens(ingredients) %>%
dfm() %>%
dfm_match(features = featnames(recipe_dfm))
predict(nb_output, newdata = dfm_stew, type = "probability")
Curry Not Curry
text1 0.9611718 0.03882815
Yes!
Advantages
Fast
Easy to apply
Can easily be extended to include…
Disadvantages
Independence assumption
Independence means NB is unable to account for interactions between words
Independence also means that NB is often overconfident
In some contexts, the independence assumption can decrease predictive accuracy
Linear classifier
How can we assess the classification performance of our supervised learning classifier?
Our goal is to measure the degree to which the predictions we make correspond to the observed data
We have already seen some ways to do this
In order to get informative estimates of these quantities, we need to distinguish between the performance of the classifier on the training set and the test set
The test error is the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in training the method.
In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training.
Training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.
We can think of the test error associated with any given statistical estimator as coming from two fundamental quantities:
Bias
Variance
Ideally we would like to minimize both variance and bias, but these goals are often at odds with each other.
As we use more flexible models, the variance will increase and the bias will decrease
The relative rate of change of these two quantities determines whether the test error increases or decreases
As we start to make the model more flexible the bias will tend to decrease faster than the variance will increase
After some point, adding more flexibility will decrease the the bias a bit, but the variance will increase a lot
Implication: We need tools which tell us when we have reached the optimal balance between bias and variance.
We randomly divide the available set of samples into two parts: a training set and a test set.
The model is fit on the training set, and the fitted model is used to predict the responses for the test set.
We then calculate classification performance scores (accuracy, sensitivity, specificity, etc) for the test set.
Before we train a model, we need to separate our data into a training set and a test set:
We then subset the recipe_dfm
object into a training dfm and a test dfm:
We then train our Naive Bayes model on the training set:
Confusion Matrix and Statistics
true_classification
predicted_classification Curry Not Curry
Curry 38 101
Not Curry 15 1723
Accuracy : 0.9382
95% CI : (0.9263, 0.9487)
No Information Rate : 0.9718
P-Value [Acc > NIR] : 1
Kappa : 0.3701
Mcnemar's Test P-Value : 2.973e-15
Sensitivity : 0.71698
Specificity : 0.94463
Pos Pred Value : 0.27338
Neg Pred Value : 0.99137
Prevalence : 0.02824
Detection Rate : 0.02025
Detection Prevalence : 0.07405
Balanced Accuracy : 0.83080
'Positive' Class : Curry
Implication:
Relative to the dictionary approach we are…
…doing a better job on predicting true positives now (our sensitivity is much higher)
…predicting too many curries that are actually something else (our specificity is a little lower)
The test set and training set accuracy can be very different
As a model becomes more flexible…
Imagine that we include a very large number of features in our dfm
How does the training/test set accuracy change as we increase the number of features used to train the classifier?
Question: Why does the test-set accuracy decrease when we add additional features?
Answer: Because we are now overfitting our data.
Overfitting occurs when we find relationships between words (or n-grams) and curries in our training data that do not generalise to our test data
In this example, there are some n-gram phrases that appear frequently in the curry recipes in our training set but which never appear in our test-set curry recipes
Feature | Training | Test |
---|---|---|
mustard_seeds_tsp | 18 | 0 |
tsp_black_mustard | 15 | 0 |
tsp_black_mustard_seeds | 15 | 0 |
leaves_and_stalks | 20 | 0 |
black_mustard_seeds_tsp | 10 | 0 |
coriander_leaves_and | 14 | 0 |
cumin_seeds_tsp_black | 7 | 0 |
coriander_leaves_and_stalks | 13 | 0 |
large_garlic_cloves | 9 | 0 |
chopped_garlic_cloves_peeled_and | 15 | 0 |
We can use the test-set performance statistics to select between model specifications
We will compare the accuracy, sensitivity and specificity for the following models:
The “best” model is the one which has the highest classification scores
Model | Accuracy | Sensitivity | Specificity | N features |
---|---|---|---|---|
Original | 0.94 | 0.78 | 0.95 | 902 |
Raw | 0.96 | 0.65 | 0.97 | 4214 |
No stop words | 0.96 | 0.66 | 0.97 | 4126 |
Trimmed | 0.94 | 0.79 | 0.95 | 1339 |
N-gram | 0.98 | 0.51 | 1 | 152215 |
N-gram, trimmed | 0.94 | 0.83 | 0.94 | 6072 |
The “n-gram” model has the highest accuracy, but has very low sensitivity
The “n-gram, trimmed” model outperforms all other models in sensitivity
To calculate the test-set accuracy we randomly allocated observations to the test and training sets
If we repeat this process with a new randomization, we will get slightly different test-set performance scores
Original train-test split:
Model | Accuracy | Sensitivity | Specificity | N features |
---|---|---|---|---|
Original | 0.94 | 0.78 | 0.95 | 902 |
Raw | 0.96 | 0.65 | 0.97 | 4214 |
No stop words | 0.96 | 0.66 | 0.97 | 4126 |
Trimmed | 0.94 | 0.79 | 0.95 | 1339 |
N-gram | 0.98 | 0.51 | 1 | 152215 |
N-gram, trimmed | 0.94 | 0.83 | 0.94 | 6072 |
Rerandomization 1:
Model | Accuracy | Sensitivity | Specificity | N features |
---|---|---|---|---|
Original | 0.94 | 0.77 | 0.95 | 902 |
Raw | 0.96 | 0.66 | 0.97 | 4214 |
No stop words | 0.96 | 0.66 | 0.97 | 4126 |
Trimmed | 0.94 | 0.78 | 0.94 | 1339 |
N-gram | 0.98 | 0.45 | 1 | 152215 |
N-gram, trimmed | 0.93 | 0.83 | 0.94 | 6072 |
Re-randomization 2:
Model | Accuracy | Sensitivity | Specificity | N features |
---|---|---|---|---|
Original | 0.94 | 0.78 | 0.95 | 902 |
Raw | 0.96 | 0.65 | 0.97 | 4214 |
No stop words | 0.96 | 0.66 | 0.97 | 4126 |
Trimmed | 0.94 | 0.79 | 0.95 | 1339 |
N-gram | 0.98 | 0.51 | 1 | 152215 |
N-gram, trimmed | 0.94 | 0.83 | 0.94 | 6072 |
Rerandomization 3:
Model | Accuracy | Sensitivity | Specificity | N features |
---|---|---|---|---|
Original | 0.94 | 0.78 | 0.95 | 902 |
Raw | 0.96 | 0.64 | 0.97 | 4214 |
No stop words | 0.96 | 0.69 | 0.97 | 4126 |
Trimmed | 0.94 | 0.78 | 0.95 | 1339 |
N-gram | 0.98 | 0.51 | 1 | 152215 |
N-gram, trimmed | 0.94 | 0.83 | 0.94 | 6072 |
The simple validation approach suffers from two weaknesses:
Implication: We need a method that uses all data for training and generates more stable test-set accuracy.
Cross-validation is an alternative to a simple train-test split
This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size
For each of the \(k\) folds we do the following
We then calculate the performance metrics by averaging over those computed on each fold
get_performance_scores <- function(held_out){
# Set up train and test sets for this fold
recipe_dfm_train <- dfm_subset(recipe_dfm, !held_out)
recipe_dfm_test <- dfm_subset(recipe_dfm, held_out)
# Train model on everything except held-out fold
nb_train <- textmodel_nb(x = recipe_dfm_train,
y = recipe_dfm_train$curry,
prior = "docfreq")
# Predict for held-out fold
recipe_dfm_test$predicted_curry <- predict(nb_train,
newdata = recipe_dfm_test,
type = "class")
# Calculate accuracy, specificity, sensitivity
confusion_nb <- table(predicted_classification = recipe_dfm_test$predicted_curry,
true_classification = recipe_dfm_test$curry)
confusion_nb_statistics <- confusionMatrix(confusion_nb, positive = "Curry")
accuracy <- confusion_nb_statistics$overall[1]
sensitivity <- confusion_nb_statistics$byClass[1]
specificity <- confusion_nb_statistics$byClass[2]
return(data.frame(accuracy, sensitivity, specificity))
}
accuracy sensitivity specificity
Accuracy 0.9418182 0.754386 0.9475375
[[1]]
accuracy sensitivity specificity
Accuracy 0.9418182 0.754386 0.9475375
[[2]]
accuracy sensitivity specificity
Accuracy 0.9389356 0.6923077 0.9463358
[[3]]
accuracy sensitivity specificity
Accuracy 0.935911 0.7666667 0.9414661
[[4]]
accuracy sensitivity specificity
Accuracy 0.9420829 0.7704918 0.9478309
[[5]]
accuracy sensitivity specificity
Accuracy 0.9380252 0.7166667 0.9452278
Model | Accuracy | Sensitivity | Specificity |
---|---|---|---|
Original | 0.94 | 0.74 | 0.95 |
Raw | 0.96 | 0.6 | 0.97 |
No stop words | 0.96 | 0.61 | 0.97 |
Trimmed | 0.94 | 0.74 | 0.94 |
N-gram | 0.96 | 0.21 | 0.99 |
N-gram, trimmed | 0.93 | 0.76 | 0.93 |
Model | Accuracy | Sensitivity | Specificity |
---|---|---|---|
Original | 0.94 | 0.74 | 0.94 |
Raw | 0.96 | 0.62 | 0.97 |
No stop words | 0.96 | 0.64 | 0.97 |
Trimmed | 0.94 | 0.74 | 0.94 |
N-gram | 0.96 | 0.26 | 0.98 |
N-gram, trimmed | 0.93 | 0.77 | 0.93 |
Cross-validation is a very general strategy for evaluating predictive fit
Which variables should I use to predict my outcome?
Should I use a linear model, or a non-linear model?
Should I just use \(X\) in my regression? Or should I also use \(X^2\)? (or \(X^3\)? or \(X^4\)?)
Etc
Naive Bayes is only one supervised learning text-classification method
Regularized Logistic Regression
Directly models the probability that each document is in class \(k\) using logistic regression
Regularization required to prevent overfitting data
textmodel_lr
in quanteda
Support Vector Machines
SVMs draw a hyperplane through the multidimensional word space that best separates documents into different classes
Can accomodate non-linear boundaries between classes
textmodel_svm()
in quanteda
“Tree-based” Classification Methods
Tree-based methods separate classes by segmenting the predictors (word counts) into a number of distinct regions
The modal outcome for observations that fall within a given region becomes the predicted category for any observation in that region
Like the SVM, this allows for non-linear relationships between features and categories
tree
package in R
Supervised learning for text data allows us to learn the association between words and particular outcome categories
The Naive Bayes model is a simple model that is fast to implement and which, despite some strong assumptions, tends to provide good classification results
Once we have trained our supervised learning classifiers, it is important to validate their performance on a test-set that was not used to fit the model
Cross-validation is a general strategy for out-of-sample evaluation that can us to choose between different models
Today we will learn how to implement Naive Bayes models, and cross-validation, in R.
PUBL0099