2: Similarity, Difference and Complexity

Jack Blumenau

Motivating Example

How similar are these two modules?

[1] "Causal Inference (PUBL0050)"
[1] "This course provides an introduction to statistical methods used for causal inference in the social sciences. We will be concerned with understanding how and when it is possible to make causal claims in empirical research. In particular, we will focus on understanding which assumptions are necessary for giving research a causal interpretation, and on learning a range of approaches that can be used..."
[1] "Quantitative Text Analysis for Social Science (PUBL0099)"
[1] "Growth of text data in recent years, and the development of a set of sophisticated tools for analysing that data, offers important opportunities for social scientists to study questions that were previously amenable to only qualitative analyses.\n\nThis module will allow students to take advantage of these opportunities by providing them with an understanding of, and ability to apply, tools of quantitative text analysis..."

We will use data from the universe of modules taught at UCL to evaluate the similarity between these courses and others that you might have taken (given very different life choices).

Module Catalogue

Course Outline

  1. Representing Text as Data (I): Bag-of-Words
  2. Similarity, Difference, and Complexity 👈
  3. Language Models (I): Supervised Learning for Text Data
  4. Language Models (II): Topic Models
  5. Collecting Text Data
  6. Causal Inference with Text
  7. Representing Text as Data (II): Word Embeddings
  8. Representing Text as Data (III): Word Sequences
  9. Language Models (III): Neural Networks, Transfer Learning and Transformer Models
  10. Language Models (IV): Applying Large Language Models in Social Science Research

Similarity

Vector Space Model

  • We previously represented our text data as a document-feature matrix

    • Rows: Documents
    • Columns: Features
  • Each document is therefore described by a vector of word counts

  • This representation allows us to measure several important properties of our documents

Vectors notation

We denote a vector representation of a document using a bold letter:

\[\mathbf{a} = \{a_1, a_2, ...,a_J\} \]

where \(a_1\) is the number of times feature \(1\) appears in the document, \(a_2\) is the number of times feature 2 appears in the document, and so on.

Similarity

  • Idea: Each document can be represented by a vector of (weighted) feature counts, and that these vectors can be evaluated using similarity metrics

  • A document’s vector is simply (for now) it’s row in the document-feature matrix

  • Key question: how do we measure distance or similarity between the vector representation of two (or more) different documents?

Similarity

There are many different metrics we might use to capture similarity/difference between texts:

  1. Edit distances

  2. Inner product

  3. Euclidean distance

  4. Cosine similarity

The choice of metric comes down to an assumption about which kinds of differences are most important to consider when comparing documents.

Edit Distance

  • Edit distances measure the similarity/difference between text strings

  • A commonly used edit distance is the Levenshtein distance

  • Measures the minimal number of operations (replacing, inserting, or deleting) required to transform one string into another

  • Example: the Levenshtein distance between “kitten” and “sitting” is 3

    • kitten ➡️ sitten (substitute “k” for “s”)
    • sitten ➡️ sittin (substitute “e” for “i”)
    • sittin ➡️ sitting (insert “g” at the end)
  • In r:

x <- c("kitten", "sitting")

adist(x)
     [,1] [,2]
[1,]    0    3
[2,]    3    0
  • Generally not used in large scale applications because computationally burdensome to implement on long texts

Inner Product

Inner product

The inner product, or “dot” product, between two vectors is the sum of the element-wise multiplication of the vectors (a scalar):

\[\begin{eqnarray} \mathbf{a}\cdot\textbf{b} &=& \mathbf{a}^T\textbf{b} \\ &=& a_1b_1 + a_2b_2 + ... + a_Jb_J \end{eqnarray}\]

NB: When the vectors are dichotomized document-feature matrices (only 0s and 1s), then the inner product gives the number of features that the two documents share in common.

Imagine three documents with a six-word vocabulary:

causal estimate identification text document feature
Document a 2 3 3 0 0 1
Document b 2 0 0 3 2 3
Document c 1 2 1 1 0 1

Imagine three documents with a six-word vocabulary:

causal estimate identification text document feature
Document a 2 3 3 0 0 1
Document b 4 0 0 6 4 6
Document c 1 2 1 1 0 1

The inner product of the \(\textbf{a}\) and \(\textbf{b}\) word vectors:

\[\begin{eqnarray} \mathbf{a}\cdot\textbf{b} &=& 2\times 2 + 3\times0 + 3\times0 + 0 \times 3 + 0 \times 2 + 1 \times 3\\ \mathbf{a}\cdot\textbf{b} &=& 4 + 3 \\ \mathbf{a}\cdot\textbf{b} &=& 7 \end{eqnarray}\]

The inner product of the \(\textbf{a}\) and \(\textbf{c}\) word vectors:

\[\begin{eqnarray} \mathbf{a}\cdot\textbf{c} &=& 2\times 1 + 3\times 2 + 3\times 1 + 0 \times 1 + 0 \times 0 + 1 \times 1\\ \mathbf{a}\cdot\textbf{c} &=& 2 + 6 + 3 + 1 \\ \mathbf{a}\cdot\textbf{c} &=& 12 \end{eqnarray}\]

Comparing across document pairs:

\[\begin{eqnarray} \mathbf{a}\cdot\textbf{b} &=& 7 \\ \mathbf{a}\cdot\textbf{c} &=& 12 \end{eqnarray}\]

\(\mathbf{a}\cdot\textbf{c} > \mathbf{a}\cdot\textbf{b}\) and so C is more similar than B is to document A.

But! Notice that the inner product is sensitive to document length. What happens if we multiply the word counts in document B by 2?

The new inner product of the \(\textbf{a}\) and \(\textbf{b}\) word vectors:

\[\begin{eqnarray} \mathbf{a}\cdot\textbf{b} &=& 2\times 4 + 3\times 0 + 3\times 0 + 0 \times 6 + 0 \times 4 + 1 \times 6\\ \mathbf{a}\cdot\textbf{b} &=& 8 + 6 \\ \mathbf{a}\cdot\textbf{b} &=& 14 \end{eqnarray}\]

Comparing across document pairs:

\[\begin{eqnarray} \mathbf{a}\cdot\textbf{b} &=& 14 \\ \mathbf{a}\cdot\textbf{c} &=& 12 \end{eqnarray}\]

Now \(\mathbf{a}\cdot\textbf{c} < \mathbf{a}\cdot\textbf{b}\) and so B is more similar than C is to document A!

Euclidean Distance

Euclidean Distance

The Euclidean Distance between two document vectors, \(\mathbf{a}\) and \(\mathbf{b}\), is given by:

\[\begin{eqnarray} d(\mathbf{a},\textbf{b}) &=& \sqrt{\sum_{j=1}^J(a_j - b_j)^2}\\ &=&\left|\left| \mathbf{a}-\mathbf{b} \right|\right| \end{eqnarray}\]

Where \(J\) is the total number of features in the dfm.

  • The Euclidean distance is based on the Pythagorean theorem

  • Similar problem to the inner product: sensitive to document length

Euclidean Distance Illustration

Euclidean Distance Illustration

Cosine Similarity

  • Measures of document similarity should not be sensitive to the number of words in each of the documents

    • We don’t want long documents to be “more similar” than shorter documents just as a function of length
  • A natural way to adapt the inner product measure is to normalise by document length, which we do by calculating the magnitude of the document vectors

  • Cosine similarity is a measure of similarity that is based on the normalized inner product of two vectors

  • It can be interpreted as…

    • …a normalized version of the inner product or Euclidean distance

    • …the cosine of the angle between the two vectors

Cosine Similarity

Cosine similarity

The cosine similarity (\(cos(\theta)\)) between two vectors \(\textbf{a}\) and \(\textbf{b}\) is defined as:

\[cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\left|\left| \mathbf{a} \right|\right| \left|\left| \mathbf{b} \right|\right|}\]

where \(\theta\) is the angle between the two vectors and \(\left| \left| \mathbf{a} \right| \right|\) and \(\left| \left| \mathbf{b} \right| \right|\) are the magnitudes of the vectors \(\mathbf{a}\) and \(\mathbf{b}\), respectively.

Vector Magnitude (or “length”)

The magnitude of a vector (also known as the “length”) is the square-root of the inner product of the vector with itself:

\[\begin{eqnarray} \left|\left| \mathbf{a} \right|\right| &=& \sqrt{\textbf{a}\cdot\textbf{a}}\\ &=& \sqrt{a_1^2 + a_2^2 + ... + a_J^2} \end{eqnarray}\]

Interpretation

The value of cosine similarity ranges from -1 to 1

  • A value of 1 indicates that the vectors are identical
  • A value of 0 indicates that the vectors are orthogonal (i.e., not similar at all)
  • A value of -1 indicating that the vectors are diametrically opposed.

Thus, the closer the value is to 1, the more similar the vectors are.

Calculated for vectors of word counts (or any positively-valued vectors), the cosine similarity ranges from 0 to 1.

Cosine Similarity Illustration

Cosine Similarity Illustration

Cosine Similarity Illustration

Module Catalogue Data

str(modules)
tibble [6,248 × 10] (S3: tbl_df/tbl/data.frame)
 $ teaching_department   : chr [1:6248] "Greek and Latin" "Greek and Latin" "Bartlett School of Sustainable Construction" "Bartlett School of Architecture" ...
 $ level                 : num [1:6248] 5 4 7 5 7 7 4 7 7 7 ...
 $ intended_teaching_term: chr [1:6248] "Term 1|Term 2" "Term 1" "Term 1" "Term 2" ...
 $ credit_value          : chr [1:6248] "15" "15" "15" "30" ...
 $ mode                  : chr [1:6248] "" "" "" "" ...
 $ subject               : chr [1:6248] "Ancient Greek|Ancient Languages and Cultures|Classics" "Ancient Greek|Ancient Languages and Cultures|Classics" "" "" ...
 $ keywords              : chr [1:6248] "ANCIENT GREEK|LANGUAGE" "ANCIENT GREEK|LANGUAGE" "Infrastructure finance|Financial modelling|Investment" "DESIGN PROJECT ARCHITECTURE" ...
 $ title                 : chr [1:6248] "Advanced Greek A (GREK0009)" "Greek for Beginners A (GREK0002)" "Infrastructure Finance (BCPM0016)" "Design Project (BARC0135)" ...
 $ module_description    : chr [1:6248] "Teaching Delivery: This module is taught in 20 bi-weekly lectures and 10 weekly PGTA-led seminars.\n\nContent: "| __truncated__ "Teaching Delivery: This module is taught in 20 bi-weekly lectures and 30 tri-weekly PGTA-led seminars.\n\nConte"| __truncated__ "This module offers a broad overview of infrastructure project development, finance, and investment. By explorin"| __truncated__ "Students take forward the unit themes, together with personal ideas and concepts from BARC0097, and develop the"| __truncated__ ...
 $ code                  : chr [1:6248] "GREK0009" "GREK0002" "BCPM0016" "BARC0135" ...
modules$module_description[modules$code == "PUBL0099"]
[1] "Growth of text data in recent years, and the development of a set of sophisticated tools for analysing that data, offers important opportunities for social scientists to study questions that were previously amenable to only qualitative analyses.\n\nThis module will allow students to take advantage of these opportunities by providing them with an understanding of, and ability to apply, tools of quantitative text analysis to answer important questions in the fields of social science and public policy.\n\nThe module is centred around three core components of a typical quantitative text analysis project.\nFirst, students will learn to collect text data at scale. Students will study how to scrape text data from the web, and how to use methods such as Optimal Character Recognition to extract digitized text data from printed physical documents.\nSecond, students will learn the various ways in which written texts can be converted into data to use in quantitative analyses. This includes extracting features from the texts, such as coded categories, word counts, word types, dictionary counts, parts of speech, and so on.\nThird, the module covers a range of methods for systematically extracting quantitative information from digitized texts, including traditional approaches such as content analysis and dictionary-based methods; supervised learning approaches for text classification methods and text scaling; and also recent advances in semi-supervised and unsupervised learning for texts, such as topic models and word-embedding models.\n\nThe module has a strongly applied focus, with students learning to collect, manipulate, and analyse data themselves. The module covers a wide range of examples of how these methods are used in the social sciences, in business, and in government.\n\n \n"
modules$module_description[modules$code == "PUBL0050"]
[1] "This course provides an introduction to statistical methods used for causal inference in the social sciences. We will be concerned with understanding how and when it is possible to make causal claims in empirical research. In particular, we will focus on understanding which assumptions are necessary for giving research a causal interpretation, and on learning a range of approaches that can be used to establish causality empirically. The course will be practical – in that you can expect to learn how to apply a suite of methods in your own research – and theoretical – in that you can expect to think hard about what it means to make claims of causality in the social sciences.\nWe will address a variety of topics that are popular in the current political science literature. Topics may include experiments (laboratory, field, and natural); matching; regression; weighting; fixed-effects; difference-in-differences; regression discontinuity designs; instrumental variables; and synthetic control. Examples are typically drawn from many areas of political science, including political behaviour, institutions, international relations, and public administration.\nThe goal of the module is to teach students to understand and confidently apply various statistical methods and research designs that are essential for modern day data analysis. Students will also learn data analytic skills using the statistical software package R.\nThis is an advanced module intended for students who have already had some training in quantitative methods for data analysis. One previous course in quantitative methods, statistics, or econometrics is required for all students participating on this course. Students should therefore have a working knowledge of the methods covered in typical introductory quantitative methods courses (i.e. at least to the level of PUBL0055 or equivalent). At a minimum, this should include experience with hypothesis testing and multiple linear regression.\n"

Question: Which other modules at UCL are most similar to these two modules?

Cosine Similarity – Application

PUBL0050

# Create a corpus object from module catalogue data
modules_corpus <- corpus(modules, 
                         text_field = "module_description", 
                         docid_field = "code")

# Convert modules data into a dfm
modules_dfm <- modules_corpus %>% 
                tokens() %>% 
                dfm()

# Calculate the cosine similarity between PUBL0050 and all other modules
cosine_sim_50 <- textstat_simil(x = modules_dfm, 
                             y = modules_dfm[modules$code == "PUBL0050",],
                             method = "cosine")

head(cosine_sim_50)
          PUBL0050
GREK0009 0.6801510
GREK0002 0.6209725
BCPM0016 0.5782462
BARC0135 0.5060876
BCPM0036 0.4374233
BIDI0002 0.6731816

PUBL0099

# Calculate the cosine similarity between PUBL0099 and all other modules
cosine_sim_99 <- textstat_simil(x = modules_dfm, 
                             y = modules_dfm[modules$code == "PUBL0099",],
                             method = "cosine")

head(cosine_sim_99)
          PUBL0099
GREK0009 0.6773298
GREK0002 0.6985603
BCPM0016 0.6990896
BARC0135 0.5664775
BCPM0036 0.5114095
BIDI0002 0.7100411

Cosine Similarity – Application

Cosine Similarity – Application

Which modules are most similar to PUBL0050?

# Create a new variable in original data frame
modules$cosine_sim_50 <- as.numeric(cosine_sim_50)

# Arrange the data.frame in order of similarity and extract titles
modules %>%
  arrange(-cosine_sim_50) %>%
  select(title)
# A tibble: 6,248 × 1
   title                                                    
   <chr>                                                    
 1 Causal Inference (PUBL0050)                              
 2 Research Methods and Skills (ANTH0104)                   
 3 Regression Modelling (IEHC0050)                          
 4 Selected Topics in Statistics (STAT0017)                 
 5 Dissertation - MSc CPIPP (PHAY0053)                      
 6 Advanced Photonics Devices (ELEC0109)                    
 7 User-Centred Data Visualization (PSYC0102)               
 8 Introduction to Assessment (MDSC0002)                    
 9 Quantitative Methods and Mathematical Thinking (BASC0003)
10 Core Principles of Mental Health Research (PSBS0002)     
# ℹ 6,238 more rows

Which modules are most similar to PUBL0099?

# Create a new variable in original data frame
modules$cosine_sim_99 <- as.numeric(cosine_sim_99)

# Arrange the data.frame in order of similarity and extract titles
modules %>%
  arrange(-cosine_sim_99) %>%
  select(title)
# A tibble: 6,248 × 1
   title                                                                  
   <chr>                                                                  
 1 Quantitative Text Analysis for Social Science (PUBL0099)               
 2 Archaeological Glass and Glazes (ARCL0099)                             
 3 User-Centred Data Visualization (PSYC0102)                             
 4 Understanding and Analysing Data (SESS0006)                            
 5 Understanding and Analysing Data (SEES0107)                            
 6 Data Analysis (POLS0010)                                               
 7 Laboratory and Instrumental Skills in Archaeological Science (ARCL0170)
 8 The Anthropology of Violent Aftermaths (ANTH0136)                      
 9 Fashion Cultures (LITC0044)                                            
10 Anthropology of Politics, Violence and Crime (ANTH0175)                
# ℹ 6,238 more rows

Misleading Word Counts

Why do we recover so many strange matches for our PUBL0050 and PUBL0099 documents?

Let’s compare the most common features of the following four modules:

PUBL0099 – Quantitative Text Analysis for Social Science

topfeatures(modules_dfm[modules$code=="PUBL0099",], 8)
   ,  and   of   to    .  the   in text 
  21   12   12   11   10   10    8    8 

PUBL0050 – Causal Inference

topfeatures(modules_dfm[modules$code=="PUBL0050",], 8)
  .  in   ,  to and the  of   ; 
 14  11  11  10   9   9   8   8 

ELEC0109 – Advanced Photonics Devices

topfeatures(modules_dfm[modules$code=="ELEC0109",], 8)
and   ,  of the   ;   .  to  in 
104 100  77  77  62  55  51  48 

ARCL0099 – Archaeological Glass and Glazes

topfeatures(modules_dfm[modules$code=="ARCL0099",], 8)
   ,  the   of  and   to    .   in this 
  27   24   23   20   17   16    9    6 

Feature selection matters! Similarities here are being driven by substantively unimportant words.

One solution would be to remove stopwords and try again. An alternative is to use weighted vector representations.

Weighted Vectors

  • The bag-of-words representation characterises documents according to the raw counts of each word

  • The critical problem with using raw term frequency is that all terms are considered equally important when it comes to assessing similarity

  • One way of avoiding this problem is to weight the vectors of word counts in ways that make our text representations more informative

  • There are several strategies for weighting the word vectors that represent our documents, the most common of which is tf-idf weighting

Tf-idf intuition

  • Tf-idf stands for “term-frequency-inverse-document-frequency”

  • Tf-idf weighting can improve our representations of documents because it assigns higher weights to…

    • … words that are common in a given document (“term-frequency”) and
    • … words that are rare in the corpus as a whole (“inverse-document-frequency”)
  • Down-weighted words include…

    • …stop words (e.g. and, if, the, but, etc) and also…
    • … terms that are domain-specific but used frequently across documents (e.g. module, class, assessment, exam)
  • Up-weighted terms are therefore those words that are more distinctive and thus are more useful for characterising a given text

TF-idf

Term-frequency-inverse-document-frequency (tf-idf)

The tf-idf weighting scheme assigns to feature \(j\) a weight in document \(i\) according to:

\[\begin{eqnarray} \text{tf-idf}_{i,j} &=& W_{i,j} \times idf_j \\ &=& W_{i,j} \times log_{10}(\frac{N}{df_j}) \end{eqnarray}\]

  • \(W_{i,j}\) is the number of times feature \(j\) appears in document \(i\)
  • \(df_j\) is the number of documents in the corpus that contain feature \(j\)
  • \(N\) is the total number of documents

NB: tf-idf is specific to a feature in a document

Implications

\(\text{tf-idf}_{i,j}\) will be…

  1. …highest when feature \(j\) occurs many times in a small number of documents

  2. …lower when feature \(j\) occurs few times in a document, or occurs in many documents

  3. …lowest when feature \(j\) occurs in virtually all documents

We use \(log_\text{10}(\frac{N}{df_j})\) rather than \(\frac{N}{df_j}\) in order to avoid extremely large weights on extremely rare words.

Tf-idf – Example

Term-frequency-inverse-document-frequency (tf-idf)

The tf-idf weighting scheme assigns to feature \(j\) a weight in document \(i\) according to:

\[\begin{eqnarray} \text{tf-idf}_{i,j} &=& W_{i,j} \times idf_j \\ &=& W_{i,j} \times log_{10}(\frac{N}{df_j}) \end{eqnarray}\]

The word “the” appears…

  • …in the PUBL0099 module outline 10 times (\(W_{i,j}\))
  • …in 6119 (\(df_i\)) of the 6248 documents (\(N\))

\[\begin{eqnarray} \text{tf-idf}_{\text{PUBL0099,the}} &=& 10 \times log(\frac{6248}{6119}) \\ &=& 10 \times 0.009060568 \\ &=& 0.09060568 \end{eqnarray}\]

The word “text” appears…

  • …in the PUBL0099 text 8 times (\(W_{i,j}\))
  • …in 198 (\(df_i\)) of the 6248 documents (\(N\))

\[\begin{eqnarray} \text{tf-idf}_{\text{PUBL0099,text}} &=& 8 \times log(\frac{6248}{198}) \\ &=& 8 \times 1.499076 \\ &=& 11.99261 \end{eqnarray}\]

Tf-idf – Application

# Convert modules data into a dfm *with tf-idf wieghts*
modules_dfm_tfidf <- modules_corpus %>% 
                        tokens() %>% 
                        dfm() %>% 
                        dfm_tfidf()

modules_dfm_tfidf
Document-feature matrix of: 6,248 documents, 35,194 features (99.68% sparse) and 8 docvars.
          features
docs        teaching delivery         :      this    module        is   taught
  GREK0009 0.8071821 1.083091 2.7381878 0.3076292 0.5888072 0.3026049 1.791841
  GREK0002 1.6143641 1.083091 1.6429127 0.0769073 0.3680045 0.1513024 1.791841
  BCPM0016 0         1.083091 0.5476376 0.2307219 0.2208027 0.1513024 0       
  BARC0135 0         0        0.8214563 0.0769073 0         0.3026049 0       
  BCPM0036 0         0        1.0952751 0.0769073 0.0736009 0         0       
  BIDI0002 0         0        0.2738188 0.2307219 0.2944036 0.1513024 0       
          features
docs               in       20 bi-weekly
  GREK0009 0.43350179 1.851258  2.453318
  GREK0002 0.07881851 1.851258  2.453318
  BCPM0016 0.03940925 0         0       
  BARC0135 0          0         0       
  BCPM0036 0.03940925 0         0       
  BIDI0002 0.15763701 0         0       
[ reached max_ndoc ... 6,242 more documents, reached max_nfeat ... 35,184 more features ]

Tf-idf – Application

What are the features with the highest tf-idf scores for our four modules?

PUBL0099 – Quantitative Text Analysis for Social Science

topfeatures(modules_dfm_tfidf[modules$code=="PUBL0099",], 8)
        text    digitized quantitative       counts         data   extracting 
   11.992607     7.591482     5.431962     5.363595     5.182539     4.831060 
     collect        texts 
    4.049778     4.030291 

PUBL0050 – Causal Inference

topfeatures(modules_dfm_tfidf[modules$code=="PUBL0050",], 8)
      causal    causality   regression            ;      methods       expect 
    7.093132     5.183242     5.065593     4.634490     4.512744     4.344983 
quantitative       claims 
    4.073971     3.979122 

ELEC0109 – Advanced Photonics Devices

topfeatures(modules_dfm_tfidf[modules$code=="ELEC0109",], 8)
            ;       optical         laser        lasers semiconductor 
     35.91729      35.64063      31.79536      28.92651      26.55579 
     photonic        liquid       devices 
     25.16167      24.44904      24.23796 

ARCL0099 – Archaeological Glass and Glazes

topfeatures(modules_dfm_tfidf[modules$code=="ARCL0099",], 8)
        glass        glazes      pigments         beads     materials 
    14.753215      7.591482      6.989422      6.637240      5.632121 
chronological     siliceous    ornamental 
     4.285057      3.795741      3.795741 

Tf-idf cosine similarity

# Calculate the cosine similarity between PUBL0050 and all other modules
cosine_sim_tfidf_50 <- textstat_simil(x = modules_dfm_tfidf, 
                             y = modules_dfm_tfidf[modules$code == "PUBL0050",],
                             method = "cosine")

# Calculate the cosine similarity between PUBL0099 and all other modules
cosine_sim_tfidf_99 <- textstat_simil(x = modules_dfm_tfidf, 
                             y = modules_dfm_tfidf[modules$code == "PUBL0099",],
                             method = "cosine")

Tf-idf – Application

Cosine Similarity – Application

Which modules are most similar to PUBL0050?

# Create a new variable in original data frame
modules$cosine_sim_tfidf_50 <- as.numeric(cosine_sim_tfidf_50)

# Arrange the data.frame in order of similarity and extract titles
modules %>%
  arrange(-cosine_sim_tfidf_50) %>%
  select(title)
# A tibble: 6,248 × 1
   title                                                     
   <chr>                                                     
 1 Causal Inference (PUBL0050)                               
 2 Causal Analysis in Data Science (POLS0012)                
 3 Advanced Quantitative Methods (PHDE0084)                  
 4 Quantitative Data Analysis (POLS0083)                     
 5 Advanced Statistics for Records Research (CHME0015)       
 6 Understanding and Analysing Data (SESS0006)               
 7 Understanding and Analysing Data (SEES0107)               
 8 Quantitative and Qualitative Research Methods 1 (IEHC0020)
 9 Statistics for Health Economics (STAT0039)                
10 Introduction to Statistics for Social Research (ANTH0107) 
# ℹ 6,238 more rows

Which modules are most similar to PUBL0099?

# Create a new variable in original data frame
modules$cosine_sim_tfidf_99 <- as.numeric(cosine_sim_tfidf_99)

# Arrange the data.frame in order of similarity and extract titles
modules %>%
  arrange(-cosine_sim_tfidf_99) %>%
  select(title)
# A tibble: 6,248 × 1
   title                                                                        
   <chr>                                                                        
 1 Quantitative Text Analysis for Social Science (PUBL0099)                     
 2 Data Science for Crime Scientists (SECU0050)                                 
 3 Understanding and Analysing Data (SESS0006)                                  
 4 Understanding and Analysing Data (SEES0107)                                  
 5 Data Analysis (POLS0010)                                                     
 6 Quantitative Data Analysis (POLS0083)                                        
 7 Literary Linguistics A (ENGL0042)                                            
 8 Analysing Research Data (IOEF0026)                                           
 9 Middle Bronze Age to the Iron Age in the Near East: City-States and Empires …
10 Research Methods - Quantitative (CENG0045)                                   
# ℹ 6,238 more rows

Tf-idf Does Not Solve All Problems

Consider these two sentences:

  • “Quantitative text analysis is very successful.”

  • “Natural language processing is tremendously effective.”

Represented as a DFM:

quantitative text analysis very successful natural language processing tremendously effective
D1 1 1 1 1 1 0 0 0 0 0
D2 0 0 0 0 0 1 1 1 1 1

The cosine similarity between these vectors is:

\[cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\left|\left| \mathbf{a} \right|\right| \left|\left| \mathbf{b} \right|\right|}=0\]

No weighting can address the core problem: the sentences are non-overlapping sets of words.

We will see one powerful alternative to this problem when we consider word-embedding approaches.

Cosine Similarity Example

Does public opinion affect political speech? (Hager and Hilbig, 2020)

Does learning about the public’s attitudes on a political issue change how much attention politicians pay to that issue in their public statements?

Set up:

  • Politicians in Germany have historically received public opinion research on citizens’ attitudes

  • Release of the polling data is exogenously determined, providing causal identification (via a regression-discontinuity design)

  • Strategy: Measure the linguistic (cosine) similarity between reports summarising public opinion and political speeches

Cosine Similarity Example

Implication: Public statements of politicians move closer to summaries of public opinion after publication of polling research.

Break

Difference

Detecting discriminating words

Sometimes we want to characterise differences between documents, not just measure the similarity.

We want to find a set of words that conveys the distinct content between documents.

We might be interested in, for example, how language use differs between…

  1. …politicians on the left and the right (Diermeier et. al., 2012)

  2. …male and female voters (Cunha, et. al., 2014)

  3. …blog posts written by people in different geographic regions (Eisenstein et. al., 2010)

Identifying discriminating words between groups is useful because these words tend to help us characterise the type of language/arguments/linguistic frames that a group employs.

Word clouds

The high-dimensional nature of natural language means that often the best methods for detecting discriminating words are those that allow us to visualise differences between groups.

A very common method for visualising corpus- or group-wide word use is via a word cloud.

  • A word cloud is a visual representation of the frequency and importance of words in a given text.

  • The size of each word in the cloud reflects its frequency or importance within the text.

  • The layout of the words in the cloud is usually random, but it is possible to arrange terms such that their placement reflects some variation of interest

We will use word clouds to explore the differences between Economics and Political Science modules.

Word clouds – Application

# Load library for plotting
library(quanteda.textplots)

# Remove stopwords
modules_dfm <- modules_dfm %>% dfm_remove(stopwords("en"))

# Subset the modules_dfm object to only modules in PS or Econ
ps_dfm <- modules_dfm[docvars(modules_dfm)$teaching_department == "Political Science",]
econ_dfm <- modules_dfm[docvars(modules_dfm)$teaching_department == "Economics",]

# Create word clouds with top 300 features in each dfm
textplot_wordcloud(ps_dfm, max_words = 300)
textplot_wordcloud(econ_dfm, max_words = 300)

Word clouds – Application

Although there are some differences, many words are common across both sets of document.

Can we do better using tf-idf weighting?

Word clouds – Application

library(quanteda.textplots)

# Create a corpus object from module catalogue data
modules_corpus <- corpus(modules, 
                         text_field = "module_description", 
                         docid_field = "code")

# Convert modules data into a difm
ps_econ_dfm_tf_idf <- modules_corpus %>% 
                tokens(remove_punct = T) %>% 
                dfm() %>%
                dfm_tfidf()

Word clouds – Application

Even with tf-idf weighting, it is hard to identify many of the distinguishing words.

The Problem with Word Clouds

Humans can only visualise a limited number of dimensions:

  • Width (i.e. x-axis position)
  • Height (i.e. y-axis position)
  • Depth (i.e. z-axis position, hard for most people)
  • Colour
  • Shape
  • Size
  • Opacity

Core problem of word clouds: they do not take full advantage of the dimensional space available to them

  • The primary dimensions for visualizing variation (x-axis and y-axis) are meaningless!

  • Word clouds are poorly suited to visualizing differences in word use across documents

Fightin’ Words

An alternative approach is to directly visualise the difference in word use across groups.

One such approach is provided by the “Fightin’ words” method (Munroe, et. al, 2008).

Fightin’ Words

We start by calculating the probability of observing a given word for a given category of documents (here, departments):

\[\hat{\mu}_{j,k} = \frac{W^*_{j,k} + \alpha_j}{n_k + \sum_{j=1}^J \alpha_{j}}\]

  • \(W^*_{j,k}\) is the number of times feature \(j\) appears in documents in category \(k\)
  • \(n_k\) is the total number of tokens in documents in category \(k\)
  • \(\alpha_j\) is a “regularization” parameter, which shrinks differences in very common words towards 0

Next, we take the ratio of the log-odds for category \(k\) and \(k'\):

\[\text{log-odds-ratio}_{j,k} = log\left( \frac{\hat{\mu}_{j,k}}{1-\hat{\mu}_{j,k}}\right) - log\left( \frac{\hat{\mu}_{j,k'}}{1-\hat{\mu}_{j,k'}}\right) \]

Intuitively, this ratio estimates the relative probability of the use of word \(j\) between the two groups. When this ratio is positive, group \(k\) uses the word more often. When it is negative, group \(k'\) uses it more often.

Finally, we standardize the ratio by its variance (which downweights differences about which we are uncertain):

\[\text{Fightin' Words Score}_j = \frac{\text{log-odds-ratio}_{j,k}}{\sqrt{Var(\text{log-odds-ratio}_{j,k})}}\]

Fightin’ Words implementation

There is no good existing implementation of the Fightin’ Words method in R.

This function implements the scores described previously.

fightin_words <- function(dfm_input, covariate, group_1 = "Political Science", group_2 = "Economics", alpha_0 = 0){
  
  # Subset DFM
  fw_dfm <- dfm_subset(dfm_input, get(covariate) %in% c(group_1, group_2)) 
  fw_dfm <- dfm_group(fw_dfm, get(covariate))
  fw_dfm <- fw_dfm[,colSums(fw_dfm)!=0]
  dfm_input_trimmed <- dfm_match(dfm_input, featnames(fw_dfm))
  
  # Calculate word-specific priors
  alpha_w <- (colSums(dfm_input_trimmed))*(alpha_0/sum(dfm_input_trimmed))
   
  for(i in 1:nrow(fw_dfm)) fw_dfm[i,] <- fw_dfm[i,] + alpha_w
  fw_dfm <- as.dfm(fw_dfm)
  mu <- fw_dfm %>% dfm_weight("prop")

  # Calculate log-odds ratio
  lo_g1 <- log(as.numeric(mu[group_1,])/(1-as.numeric(mu[group_1,])))
  lo_g2 <- log(as.numeric(mu[group_2,])/(1-as.numeric(mu[group_2,])))
  fw <- lo_g1 - lo_g2
  
  # Calculate variance
  
  fw_var <- as.numeric(1/(fw_dfm[1,])) + as.numeric(1/(fw_dfm[2,]))
  
  fw_scores <- data.frame(score = fw/sqrt(fw_var),
                          n = colSums(fw_dfm),
                          feature = featnames(fw_dfm))

  return(fw_scores)
  
}

Fightin’ Words application

# Construct a dfm
department_dfm <- modules_corpus %>%
  tokens(remove_punct = T, remove_symbols = T, remove_numbers = T) %>%
  dfm() %>%
  dfm_remove(stopwords("en")) 

# Apply the fightin' words method
fw_scores <- fightin_words(department_dfm, 
                           covariate = "teaching_department",
                           group_1 = "Political Science",
                           group_2 = "Economics")

# Plot the results
fw_scores %>% ggplot(aes(x = log(n), # x-axis
                         y = score, # y-axis
                         label = feature, # text labels
                         cex = abs(score), # text size
                         alpha = abs(score))) + # opacity
                geom_text() + # plot text
                xlab("log(n)") + # x-axis label 
                ylab("Fightin' words score") + # y-axis label
                theme_bw() + # nice black and white theme
                theme(panel.grid = element_blank()) + # remove grid lines
                scale_size_continuous(guide = "none") + # remove size legend
                scale_alpha_continuous(guide = "none") # remove opacity legend

Fightin’ words – Politics v Economics

Political Science Economics
political economics
international year
politics students
rights models
public prerequisites
policy aims
global suitable
human bsc
law knowledge
questions final
institutions assumed
contemporary degree
debates equivalent
democracy microeconomics
conflict econometrics

Fightin’ words – Politics v History

Political Science History
policy history
international please
politics period
rights term
human historical
public sources
questions description
law full
research war
empirical century
theoretical cultural
political religious
institutions content
theories culture
able half

Fightin’ words – Politics v Geography

Political Science Geography
political urban
policy geography
international data
public spatial
politics modelling
rights skills
law change
questions climate
institutions environmental
democracy thinking
gender course
conflict conservation
theories cities
states model
actors water

Example: Fightin’ words

Example: Fightin’ words

Complexity

Text Complexity

An additional potential quantity of interest is the complexity of the language used across a corpus of text.

Examples:

  • Does central bank communicative clarity affect financial market volatility? (Jansen, 2011)

  • Which Supreme Court justices write the most complex legal opinions? (Owens and Wedeking, 2011)

  • How do the communication strategies of politicians respond to electoral reforms? (Spirling, 2016)

Lexical Diversity

  • One simple measure of the linguistic diversity of a text is the Type-to-Token Ratio (TTR)

\[ TTR = \frac{\text{Number of Types} (V)}{\text{Number of Tokens} (N)} \]

  • Example:

    • “PUBL0099 is a really really really good course.” ➡️ \(TTR = \frac{6}{8} = .75\)
    • “PUBL0099 is a terrific, fantastic, and inspiring, course.” ➡️ \(TTR = \frac{8}{8} = 1\)
  • Problem: Very sensitive to overall document length

    • Shorter texts may exhibit fewer word repetitions
    • Longer texts may introduce additional subjects, which will also increase lexical richness

Implementation:

module_toks <- tokens(modules_corpus)
module_ttr <- textstat_lexdiv(module_toks, "TTR")
head(module_ttr)
  document       TTR
1 GREK0009 0.4968750
2 GREK0002 0.5597826
3 BCPM0016 0.6279070
4 BARC0135 0.7200000
5 BCPM0036 0.5297619
6 BIDI0002 0.6232877

Lexical Diversity and Document Length

Lexical Diversity and Corpus Length

In most text, the rate at which new types appear is very high at first, but then diminishes.

Each point on the plot indicates 500 additional module descriptions:

Readability Scores

  • Most commonly used readability scores focus on a combination of syllables and sentence length

  • Shorter sentences = more readable

  • Fewer syllables = more readable

Flesch Readability Score:

\[ 206.835 - 1.105\left(\frac{\text{total number of words}}{\text{total number of sentences}}\right) - 84.6\left(\frac{\text{total number of syllables}}{\text{total number of words}}\right) \]

Where did these numbers come from? Flesch (1928)…

  • …asked high-school students to read a set of texts and ask them some comprehension questions

  • …calculated the average grade of the students who correctly answer 75%+ of the questions

  • …transformed the average grades to the 0-100 scale

  • …regressed these scores on the sentence- and word-length variables

Readability Extensions

Is the Flesch measure really sufficient? What might be missing?

Benoit, Spirling, and Munger (2019):

  1. Other features of complexity/readability (word rarity; Syntactic and grammatical structure)

    • Use relative frequency of terms compared to “the” in google books (dynamic over time)
    • Use number of clauses; proportion of nouns/verbs/adjectives/adverbs
  2. In-domain validation (are the predictors of “complexity” the same in politics and education?)

    • Crowdsource comparison task of pairs of political sentences (SOTU addresses)
  3. Uncertainty estimates (is a text with FRE = 50 really more readable than one with FRE = 55?)

    • Bradley-Terry model for paired comparisons to provide probabilistic statements of relative complexity

Readability Extensions

Benoit, Spirling, and Munger (2019) findings:

  1. Most important predictors are sentence length, the proportion of nouns, word rarity, word length

    • Sound familiar?
  2. Modest improvement over FRE score (3 percentage point improvement over 70% baseline)

  3. Very high correlation with basic Flesch measure

Readability Example

Text Readability in Quanteda

# Readability
module_read <- textstat_readability(modules_corpus,
                                    measure = "Flesch")
head(module_read, n = 3)
  document    Flesch
1 GREK0009  44.72685
2 GREK0002  41.56281
3 BCPM0016 -10.97096

Most Readable Module Description

docvars(modules_corpus)$title[which.max(module_read$Flesch)]
[1] "Elliptic Curves (MATH0036)"
as.character(modules_corpus[which.max(module_read$Flesch)])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           MATH0036 
"This is a course in number theory. An elliptic curve is an equation of the form y2 = x3 + ax2 + bx + c, where a, b, c are given rational numbers. The aim of the course is to be able to find the solutions (x, y) to this equation with x and y rational numbers. The methods used are from geometry and algebra. The study of elliptic curves is an important part of current research in number theory and cryptography. It was central to the proof of Fermat's last theorem. There are still many unsolved problems in this area, in particular the Birch-Swinnerton-Dyer conjecture, for which there is a $1 million prize offered by the Clay Institute.\n" 

Note that the most “readable” course description does not mean the most accessible course!

Least Readable Module Description

docvars(modules_corpus)$title[which.min(module_read$Flesch)]
[1] "Fundamental Medical Retina (OPHT0048)"
as.character(modules_corpus[which.min(module_read$Flesch)])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                OPHT0048 
"This 15 - credit module aims to provide the knowledge of common medical retina conditions and includes topics covering screening, referral and possible treatment pathways, with an emphasis on optical coherence tomography (OCT) interpretation and diabetic retinopathy grading. It also aims to provide you with the knowledge and skills to make accurate and appropriate referral decisions for patients with medical retina conditions and prepares you to commence working under supervision in medical retina new patient triage clinics, AMD treatment-retreatment clinics and to work in photography based diabetic retinopathy screening services\n\nThe learning outcomes of this module include:\n\na detailed knowledge of the anatomy, physiology and pathophysiology of the retina, with emphasis on the macula\n an understanding of the risk factors and differential diagnosis of disorders of retinal and macular pathology\n an understanding of treatments of medical retina disorders including the patient’s response to treatment\n an ability to communicate effectively with patients\n an ability to interpret OCT images and fundus photographs for AMD and diabetic retinopathy, with appropriate patient management\n an awareness of the use of fluorescein, ICG angiography and autofluorescence in medical retina service delivery\n an understanding of the principles, processes, protocols, potential benefits and limitations of national diabetic retinopathy screening programmes\n an understanding of diabetes and its relevance to retinopathy screening\n an ability to detect and classify diabetic retinal disease\n an ability to recognise acute retinal pathology, conduct appropriate tests and make appropriate referrals, clearly stating the level of urgency\n an awareness of UK national referral guidelines and detailed knowledge of local referral pathways for patients with medical retina disorders\n an awareness of the rapidly evolving nature of medical retina treatments including pertinent treatment trials\n an understanding of current guidelines for management of medical retina disorders\n safeguarding adults and children\nThis module aims to provide you with the knowledge of common medical retina conditions and includes topics covering screening, referral and possible treatment pathways, with an emphasis on optical coherence tomography (OCT) interpretation and diabetic retinopathy grading. It also aims to provide you with the knowledge and skills to make accurate and appropriate referral decisions for patients with medical retina conditions and prepares you to commence working under supervision in medical retina new patient triage clinics, AMD treatment-retreatment clinics and to work in photography based diabetic retinopathy screening services  \n\nThe learning outcomes of this module include: \n\n\n a detailed knowledge of the anatomy, physiology and pathophysiology of the retina, with emphasis on the macula  \n \n \n an understanding of the risk factors and differential diagnosis of disorders of retinal and macular pathology  \n \n \n an understanding of treatments of medical retina disorders including the patient’s response to treatment  \n \n \n an ability to communicate effectively with patients  \n \n \n an ability to interpret OCT images and fundus photographs for AMD and diabetic retinopathy, with appropriate patient management  \n \n \n an awareness of the use of fluorescein, ICG angiography and autofluorescence in medical retina service delivery  \n \n \n an understanding of the principles, processes, protocols, potential benefits and limitations of national diabetic retinopathy screening programmes  \n \n \n an understanding of diabetes and its relevance to retinopathy screening  \n \n \n an ability to detect and classify diabetic retinal disease  \n \n \n an ability to recognise acute retinal pathology, conduct appropriate tests and make appropriate referrals, clearly stating the level of urgency  \n \n \n an awareness of UK national referral guidelines and detailed knowledge of local referral pathways for patients with medical retina disorders  \n \n \n an awareness of the rapidly evolving nature of medical retina treatments including pertinent treatment trials  \n \n \n an understanding of current guidelines for management of medical retina disorders  \n \n \n safeguarding adults and children  \n \n" 

Readability by Department

Department Readability
Most readable
Slade School of Fine Art 45.8
Greek and Latin 34.3
Philosophy 34.2
SSEES - History 32.5
Institute of Clinical Trials and Methodology 31.7
Least readable
Biochemical Engineering 0.9
Institute of Ophthalmology -0.6
Engineering Sciences -0.8
Chemical Engineering -21.6
UCL Institute of Education -22.2

Readability Example (Spirling, 2015)

  • Research question: Do Members of Parliament use less complex language when appealing to a more diverse electorate?
  • Context: Parliamentary speeches before and after the Great Reform Act (1867)

Readability Example (Spirling, 2015)

Conclusion

Summing Up

  1. Using a vector-based representation allows us to calculate the similarity between documents

  2. Visualising differences in relative word use can help us to characterise texts pertaining to different groups

  3. The linguistic diversity and sentence length of documents can provide measures of textual sophistication