2  Dictionaries

2.1 Measuring Moral Sentiment

Moral Foundations Theory is a social psychological theory that suggests that people’s moral reasoning is based on five separate moral values. These foundation, described in the table below, are care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, and sanctity/degradation. According to the theory, people rely on these values to make moral judgments about different issues and situations. The theory also suggests that people differ in the importance they place on each of these values, and that these differences can help explain cultural and individual differences in moral judgment.1 The table below gives a brief overview of these foundations:

  • 1 Moral Foundations Theory was developed and popularised by Jonathan Haidt and his coauthors. For a good and accessible introduction, see Haidt’s book on the topic.

  • Moral Foundations Theory categories.
    Foundation Description
    Care Concern for caring behaviour, kindness, compassion
    Fairness Concern for fairness, justice, trustworthiness
    Authority Concern for obedience and deference to figures of authority (religious, state, etc)
    Loyalty Concern for loyalty, patriotism, self-sacrifice
    Sanctity Concern for temperance, chastity, piety, cleanliness

    Can we detect the use of these moral foundations from written text? Moral framing and rhetoric play an important role in political argument and entrenched moral divisions are frequently cited as the root cause of political polarization, particularly in online settings. Before we can answer important research questions such as whether there are large political differences in the use of moral language (as here, for example), or whether moral argument can reduce political polarization (as here, for example), we need to be able to measure the use of moral language at scale. In this seminar, we will therefore use a simple dictionary analysis to measure the degree to which a set of online texts display the types of moral language described by Moral Foundations Theory.

    2.1.1 Data

    We will use two sources of data for today’s assignment.

    1. Moral Foundations Dictionarymft_dictionary.csv

      • This file contains lists of words that are thought to indicate the presence of different moral concerns in text. The dictionary was originally developed by Jesse Graham and Jonathan Haidt and is described in more detail in this paper.
      • The file includes 5 categories of moral concern – authority, loyalty, santity, fairness, and care – each of which is associated with a different list of words
    2. Moral Foundations Reddit Corpusmft_texts.csv

      • This file contains 17886 English Reddit comments that have been curated from 11 distinct subreddits. In addition to the texts – which cover a wide variety of different topics – the data also includes hand-annotations by trained annotators for the different categories of moral concern described by Moral Foundations Theory.

    Once you have downloaded these files and stored them somewhere sensible, you can load them into R using the following commands:

    mft_dictionary_words <- read_csv("mft_dictionary.csv")
    mft_texts <- read_csv("mft_texts.csv")

    2.1.2 Packages

    You will need to load the following packages before beginning the assignment:

    library(tidyverse)
    library(quanteda)
    # Run the following two lines of code to install quanteda.dictionaries and then delete them
    # install("devtools")
    # devtools::install_github("kbenoit/quanteda.dictionaries")
    library(quanteda.dictionaries)

    2.2 Dictionaries

    2.2.1 Descriptive statistics

    1. Convert the mft_texts object into a corpus using the corpus() function (you will need to set the text_field argument to be equal to "text").
    Reveal code
    mft_corpus <- mft_texts %>% corpus(text_field = "text")
    1. Use the ntoken() argument to measure the number of tokens in each text in the corpus. Assign the output of that function to a new object so that you can use it later.
    Reveal code
    mft_n_words <- ntoken(mft_corpus)
    1. Create a histogram using the hist() function to show the distribution of document lengths in the corpus. Use the summary() function to report some measures of central tendency. Interpret these outputs.
    Reveal code
    hist(mft_n_words)

    summary(mft_n_words)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
       7.00   20.00   31.00   39.04   53.00  176.00 

    The median length of the comments in the Reddit corpus is 31 words. The figure shows that the distribution of lengths across the corpus is positively skewed, as the mass of the distribution is concentrated on the left.

    1. Create a document-feature matrix from the corpus object that you created above. What decisions are you going to make about feature selection? Does it matter for this week?
    Reveal code
    mft_dfm <- mft_corpus %>% 
      tokens(remove_punct = TRUE) %>% 
      dfm() %>%
      dfm_trim(min_termfreq = 5)

    Here, we have removed punctuation and kept words that appear at least 5 times in the corpus. However, when we apply a dictionary method then the feature selection decisions we use here are less important, as we are about to ignore any word that does not appear in our dictionary.

    2.2.2 Create a dictionary

    Dictionaries are named lists, consisting of a “key” and a set of entries defining the equivalence class for the given key. In quanteda, dictionaries are created using the dictionary() function. This function takes as input a list, which should contain a number of named character vectors.

    For instance, say we wanted to create a simple dictionary for measuring words related to the two courses: quantitative text analysis and causal inference. We would do so by first creating vectors of words that we think measures each concept and store them in a list:

    teaching_dictionary_list <- list(qta = c("quantitative", "text", "analysis", "document", "feature", "matrix"),
                                     causal = c("potential", "outcomes", "framework", "counterfactual"))

    And then we would pass that vector to the dictionary() function:

    teaching_dictionary <- dictionary(teaching_dictionary_list)
    teaching_dictionary
    Dictionary object with 2 key entries.
    - [qta]:
      - quantitative, text, analysis, document, feature, matrix
    - [causal]:
      - potential, outcomes, framework, counterfactual

    We could then, of course, expand the number of categories and also add other words to each category.

    Before starting on the coding questions, take a look at the mft_dictionary.csv file. Do the words associated with each foundation make sense to you? Do any of them not make sense? Remember, constructing a dictionary requires a lot of subjective judgement and the words that are included will, of course, have a large bearing on the results of any analysis that you conduct!

    1. Create a quanteda dictionary object from the words in the mft_dictionary_words data. Your dictionary should have 2 categories – one for the “care” foundation and one for the “sanctity” foundation.

    Hint: To do so, you will need to subset the word variable to only those words associated with each foundation.

    Reveal code
    mft_dictionary_list <- list(
      care = mft_dictionary_words$word[mft_dictionary_words$foundation == "care"],
      sanctity = mft_dictionary_words$word[mft_dictionary_words$foundation == "sanctity"]
      )
    
    mft_dictionary <- dictionary(mft_dictionary_list)
    mft_dictionary
    Dictionary object with 2 key entries.
    - [care]:
      - alleviate, alleviated, alleviates, alleviating, alleviation, altruism, altruist, beneficence, beneficiary, benefit, benefits, benefitted, benefitting, benevolence, benevolent, care, cared, caregiver, cares, caring [ ... and 444 more ]
    - [sanctity]:
      - abstinance, abstinence, allah, almighty, angel, apostle, apostles, atone, atoned, atonement, atones, atoning, beatification, beatify, beatifying, bible, bibles, biblical, bless, blessed [ ... and 640 more ]
    1. Use the dictionary that you created in the question above and apply it to the document-feature matrix that you created earlier in this assignment. To do so, you will need to use the dfm_lookup() function. Look at the help file for this function (?dfm_lookup) if you need to.
    Reveal code
    mft_dfm_dictionary <- dfm_lookup(mft_dfm, mft_dictionary)
    1. The dictionary-dfm that you just created records the count of words in each text that is related to each moral foundations category. As we saw earlier, however, not all texts have the same number of words! Create a new version of the dictionary-dfm which includes the proportion of words in each text that is related to each of the moral foundation categories.
    Reveal code
    mft_dfm_dictionary_proportions <- mft_dfm_dictionary/mft_n_words
    1. Store the dictionary scores for each foundation as new variables in the original mft_texts data.frame. You will need to use the as.numeric function to coerce each column of the dfm to the right format to be stored in the data.frame.
    Reveal code
    mft_texts$care_dictionary <- as.numeric(mft_dfm_dictionary_proportions[,1])
    mft_texts$sanctity_dictionary <- as.numeric(mft_dfm_dictionary_proportions[,2])

    Note that the code here is simply taking the output of applying the dictionaries and assigning those values to the mft_texts data.frame. This just makes later steps of the analysis a little simpler.

    2.2.3 Validity checks

    Now that we have constructed our dictionary measure, we will conduct some basic validity checks. We will start by directly examining the texts that are scored highly by the dictionaries in each of the categories.

    To do so, we need to order the texts by the scores they were assigned by the dictionary analysis. For example, for the “care” foundation, we could use the following code:

    mft_texts$text[order(mft_texts$care_dictionary, decreasing = TRUE)][1:5]
    • the square brackets operator ([]) allows us to subset the mft_texts$text variable
    • the order() function orders observations according to their value on the mft_texts$care_dictionary variable, and decreasing = TRUE means that the order will be from largest values to smallest values
    1. Find the 5 texts that have the highest dictionary scores for the care and sanctity foundations.
    Reveal code
    mft_texts$text[order(mft_texts$care_dictionary, decreasing = TRUE)][1:5]
    [1] "i doubt she'd want \"help\" from a childhood bully 🙄"                                                                                                        
    [2] "We are talking about threats to human safety, not threats to property."                                                                                       
    [3] "What guarantees adoptive parents would have been loving? That the child wouldn’t suffer"                                                                      
    [4] "The dress design didn’t have anything to do with child sexual assault victims. https://www.papermag.com/alexander-mcqueen-dancing-girls-dress-2645945769.html"
    [5] "This right here. Threatening violence of any kind is not ok, sexual rape violence SO not ok."                                                                 
    mft_texts$text[order(mft_texts$sanctity_dictionary, decreasing = TRUE)][1:5]
    [1] "hot fucking damn macron this last bit was inspiring as hell"              
    [2] "Fuck you. Seriously, fuck you. Get your shit together you fucking junkie."
    [3] "Hell yeah! That’s some hardcore nostalgia right there god DAMN."          
    [4] "This is so fucking sad. I hate this god damned country"                   
    [5] "Holy fuck. This makes OP TA forever. How fucking awful."                  
    1. Read the texts most strongly associated with both the care and sanctity foundations. Come to a judgement about whether you think the dictionaries are accurately capturing the concepts of care and sanctity.
    Reveal code

    In many cases these seem like reasonable texts to associate with these categories. Many of the “care” texts do seem to involve descriptions about harm to people, for example. In other cases, the dictionaries seems to be picking up on the wrong thing. For instance, all of the sanctity examples are essentially due to their use of swear words. While swearing might be correlated with moral sensitivity towards sactity issues, it is probably not constitutive of such concerns and so that suggests the sanctity dictionary might be improved.

    1. The mft_texts object includes a series of variables that record the human annotations of which category each text falls into. Use the table() function to create a confusion matrix which compares the human annotations to the scores generated by the dictionary analysis.

    Hint: For this comparison, convert your dictionary scores to a logical vector that is equal to TRUE if the text contains any word from a given dictionary, and FALSE otherwise.

    Reveal code
    care_confusion <- table( 
      dictionary = mft_texts$care_dictionary > 0,
      human_coding = mft_texts$care)
    care_confusion
              human_coding
    dictionary FALSE  TRUE
         FALSE 11304  2621
         TRUE   1842  2119
    sanctity_confusion <- table(
      dictionary = mft_texts$sanctity_dictionary > 0,
      human_coding = mft_texts$sanctity)
    sanctity_confusion
              human_coding
    dictionary FALSE  TRUE
         FALSE 13434   927
         TRUE   2705   820
    1. For each foundation, what is the accuracy of the classifier?
    Reveal code
    # Care
    (2119 + 11304)/(2119 + 11304 + 2621 + 1842)
    [1] 0.7504752
    # Sanctity
    (820 + 13434)/(820 + 13434 + 927 + 2705)
    [1] 0.7969362
    1. For each foundation, what is the sensitivity of the classifier?
    Reveal code
    # Care
    (2119)/(2119 + 2621)
    [1] 0.4470464
    # Sanctity
    820/(820 + 927)
    [1] 0.4693761
    1. For each foundation, what is the specificity of the classifier?
    Reveal code
    # Care
    (11304)/(11304 + 1842)
    [1] 0.8598813
    # Sanctity
    13434/(13434 + 2705)
    [1] 0.8323936
    1. What do these figures tell us about the performance of our dictionaries?
    Reveal code

    Although the accuracy figures are relatively high, the sensitivity scores suggest that the dictionaries are not doing a very good job of picking up on the true “care” and “sanctity” codings in these texts. The dictionaries identify under 50% of the true-positive cases for both concepts.

    2.2.4 Applications

    With these validations complete (notwithstanding the relatively weak correlation of our dictionary scores with manual codings) we are now in a position to move forward to a simple application.

    1. Calculate the mean care and sanctity dictionary scores for each of the 11 subreddits in the data.

    Hint: There are multiple ways of doing this, but one way is to combine the group_by() and summarise() functions described in the Introduction to Tidyverse page of this site.

    Reveal code
    dictionary_means_by_subreddit <- mft_texts %>%
      group_by(subreddit) %>%
      summarise(care_dictionary = mean(care_dictionary),
                sanctity_dictionary = mean(sanctity_dictionary)) 
    
    dictionary_means_by_subreddit
    # A tibble: 11 × 3
       subreddit           care_dictionary sanctity_dictionary
       <chr>                         <dbl>               <dbl>
     1 AmItheAsshole               0.0112              0.00893
     2 Conservative                0.00901             0.00934
     3 antiwork                    0.0101              0.0111 
     4 confession                  0.0110              0.0118 
     5 europe                      0.00471             0.00420
     6 geopolitics                 0.00394             0.00137
     7 neoliberal                  0.00419             0.00617
     8 nostalgia                   0.00753             0.00906
     9 politics                    0.00880             0.0103 
    10 relationship_advice         0.0123              0.0102 
    11 worldnews                   0.00623             0.00583
    • group_by() is a special function which allows us to apply data operations on groups of the named variable (here, we are grouping by subreddit, as we want to know the mean dictionary scores for each subreddit)
    • summarise() is a function which allows us to calculate various types of summary statistic for our data
    1. Interpret the subreddit averages that you have just constructed.
    Reveal code

    It is probably easier to display this information in a plot:

    dictionary_means_by_subreddit %>%
      # Transform the data to "long" format for plotting
      pivot_longer(-subreddit) %>%
      # Use ggplot
      ggplot(aes(x = name, y = subreddit, fill = value)) + 
      # geom_tile creates a heatmap
      geom_tile() + 
      # change the colours to make them prettier
      scale_fill_gradient(low = "white", high = "purple") + 
      # remove the axis labels
      xlab("") + 
      ylab("") 

    The “geopolitics” subbreddit appears to feature very little language related to sanctity; the “confession” subreddit uses lots of language related to sanctity

    Care-based language is prevalent in the “relationship_advice” and “AmItheAsshole” subreddits

    1. What is the correation between the dictionary scores for care and sanctity? Are the foundations strongly related to each other?

    Hint: The correlation between two variables can be calculated using the cor() function.

    Reveal code
    cor(mft_texts$care_dictionary,mft_texts$sanctity_dictionary)
    [1] 0.03597461

    No, the correlation between care- and sanctity-based language is very low.

    2.3 Homework

    1. Replicate the analyses in this assignment for the other three foundations (fairness, loyalty, and authority) included. Create a plot showing the average dictionary score for each foundation, for each subreddit.
    Reveal code
    mft_dictionary_list <- list(
      care = mft_dictionary_words$word[mft_dictionary_words$foundation == "care"],
      sanctity = mft_dictionary_words$word[mft_dictionary_words$foundation == "sanctity"],
      authority = mft_dictionary_words$word[mft_dictionary_words$foundation == "authority"],
      fairness = mft_dictionary_words$word[mft_dictionary_words$foundation == "fairness"],
      loyalty = mft_dictionary_words$word[mft_dictionary_words$foundation == "loyalty"]
      )
    
    mft_dictionary <- dictionary(mft_dictionary_list)
    
    mft_dfm_dictionary <- dfm_lookup(mft_dfm, mft_dictionary)
    
    mft_dfm_dictionary_proportions <- mft_dfm_dictionary/mft_n_words
    
    mft_texts$care_dictionary <- as.numeric(mft_dfm_dictionary_proportions[,1])
    mft_texts$sanctity_dictionary <- as.numeric(mft_dfm_dictionary_proportions[,2])
    mft_texts$authority_dictionary <- as.numeric(mft_dfm_dictionary_proportions[,3])
    mft_texts$fairness_dictionary <- as.numeric(mft_dfm_dictionary_proportions[,4])
    mft_texts$loyalty_dictionary <- as.numeric(mft_dfm_dictionary_proportions[,5])
    
    dictionary_means_by_subreddit <- mft_texts %>%
      group_by(subreddit) %>%
      summarise(care_dictionary = mean(care_dictionary),
                sanctity_dictionary = mean(sanctity_dictionary),
                authority_dictionary = mean(authority_dictionary),
                fairness_dictionary = mean(fairness_dictionary),
                loyalty_dictionary = mean(loyalty_dictionary)) 
    
    
    dictionary_means_by_subreddit %>%
      # Transform the data to "long" format for plotting
      pivot_longer(-subreddit) %>%
      # Use ggplot
      ggplot(aes(x = name, y = subreddit, fill = value)) + 
      # geom_tile creates a heatmap
      geom_tile() + 
      # change the colours to make them prettier
      scale_fill_gradient(low = "white", high = "purple") + 
      # remove the axis labels
      xlab("") + 
      ylab("") 

    1. Write a short new dictionary that captures a concept of interest to you. Implement this dictionary using quanteda and apply it to the Moral Foundations Reddit Corpus. Create one plot communicating something interesting that you have found from this exercise.

    Upload both of your plots to this Moodle page.

    Reveal code

    You could have specified any dictionary here. Some of the more creative ones this week included a dictionary of gratitude (works like “thanks”, “grateful”, “thank you”, etc), a dictionary of political words, and a dictionary of references to Donald Trump!