Schedule – PUBL0099: Quantitative Text Analysis for Social Science

1. Representing Text As Data (I): Bag-of-Words

In the first week of the course we will learn about typical goals of quantitative text analysis projects. We will learn one common approach to representing text as quantitative data – the bag-of-words approach. We will discuss the assumptions we make (often implicitly) when defining our corpora and how we represent texts in quantitative forms. In practical terms, we will learn about document-feature matrices, feature selection, stemming, lemmatization, and n-grams. We will also discuss how selection of documents and features can be consequential for the outcomes and conclusions of any text-as-data analyses.

We will also discuss some of the core challenges of using statistical methods to measure latent concepts using text data. We will consider a very simple text analysis approach – dictionary methods – which form a bridge between traditional qualitative approaches and the quantitative methods that we study on this course. We will introduce some of the principals of good measurement in social science and discuss the essential role of validation in any text analysis project. We will learn how to apply and interpret dictionaries to capture a range of concepts relevant to a wide variety of questions in social science.

Essential readings:

J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 3, 5 and 16
K. Benoit., “Text as data: An overview”., In: The SAGE Handbook of Research Methods in Political Science and International Relations, SAGE Publishing, London (forthcoming) (2020). – Available here

Recommended readings:

J. Grimmer and B. M. Stewart., “Text as data: The promise and pitfalls of automatic content analysis methods for political texts”., In: Political Analysis 21.3 (2013), pp. 267–297.
M. J. Denny and A. Spirling., “Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it”., In: Political Analysis 26.2 (2018), pp. 168–189.
A. D. Kramer, J. E. Guillory, and J. T. Hancock., “Experimental evidence of massive-scale emotional contagion through social networks”., In: Proceedings of the National Academy of Sciences 111.24 (2014), pp. 8788–8790.
T. Loughran and B. McDonald., “When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks”., In: The Journal of Finance 66.1 (2011), pp. 35–65.
L. Hargrave and J. Blumenau., “No Longer Conforming to Stereotypes? Gender, Political Style and Parliamentary Debate in the UK”., In: British Journal of Political Science (2021), pp. 1–18.

Seminar application: Text-as-Data and Dictionaries.

2. Similarities, Differences, Complexity

This week we will study methods for grouping texts that are similar and distinguishing between texts that are different. We will learn about the vector space model, which connects to many of the approaches that we cover later in the course. We will also learn about weighting strategies for text – such as tf-idf weighting – and we will learn about why wordclouds fail to use the full set of visual dimensions available for communicating meaning. Finally, we will consider methods for measuring the lexical complexity and readability of different texts.

Essential readings:

J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 7 and 11

Recommended readings:

A. Hager and H. Hilbig., “Does public opinion affect political speech?”, In: American Journal of Political Science 64.4 (2020), pp. 921–937.
B. L. Monroe, M. P. Colaresi, and K. M. Quinn., “Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict”., In: Political Analysis 16.4 (2008), pp. 372–403.
K. Benoit, K. Munger, and A. Spirling., “Measuring and explaining political sophistication through textual complexity”., In: American Journal of Political Science 63.2 (2019), pp. 491–508.
A. Spirling., “Democratization and linguistic complexity: The effect of franchise extension on parliamentary discourse, 1832–1915”., In: The Journal of Politics 78.1 (2016), pp. 120–136.

Seminar application: Applying similarity metrics and describing differences between groups of texts.

3. Language Models (I): Supervised Learning for Text

This week we will study methods for categorising texts into sets of pre-defined categories which is an example of supervised machine learning. We will focus on the basic mechanics that structure a supervised learning analysis; introduce the Naive Bayes classifier as a model for text classification; and discuss strategies for assessing predictive performance including training and test sets and cross-validation.

Essential readings:

J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 18, 19, and 20
G. James, D. Witten, T. Hastie, et al., An Introduction to Statistical Learning., Springer, 2022. – Chapter 5.1, “Cross-Validation”

Recommended readings:

S. Müller., “The temporal focus of campaign communication”., In: The Journal of Politics 84.1 (2022), pp. 585–590.

Seminar application: Supervised learning for text classification.

4. Language Models (II): Topic Models

Almost all the methods we study on this course try to understand the ways in which words vary across documents and whether this variation conveys “meaning” of some sort. A key source of linguistic variation across documents is the topic to which a document is addressed. This week we study a class of models that can be used to infer the topics present in a set of documents and, subsequently, the degree to which each document is relevant to each topic. Topic models have become extraordinarily popular throughout the social sciences, largely thanks to the fact that they are easy to apply to very large text corpora. We will introduce a canonical topic model – Latent Dirichlet Allocation (LDA) – and will discuss strategies for selecting the appropriate number of topics and for validating topic models. We will also discuss a number of extensions to this model.

Essential readings:

J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 12 and 13
D. M. Blei., “Probabilistic topic models”., In: Communications of the ACM 55.4 (2012), pp. 77–84.
M. E. Roberts, B. M. Stewart, D. Tingley, et al., “Structural topic models for open-ended survey responses”., In: American journal of political science 58.4 (2014), pp. 1064–1082.

Recommended readings

L. Ying, J. M. Montgomery, and B. M. Stewart., “Topics, concepts, and measurement: a crowdsourced procedure for validating topics as measures”., In: Political Analysis (2021), pp. 1–20.
D. M. Butler and J. L. Sutherland., “Have State Policy Agendas Become More Nationalized?”, In: The Journal of Politics 85.1 (2023), pp. 000–000.

Seminar application: Applying and interpreting structural topic models.

5. Designing Research Projects and Collecting Text Data

Thus far we have focused on ways of representing and manipulating digitized collections of text. But how do we go about collecting texts that we want to analyse? This week we will discuss how data collection should be pursued when conducting academic research. We will think about different modes of data collection and the advantages and disadvantages of each of them when designing social research projects. We will cover several strategies for collecting text data, including using APIs and web-scraping. We will also discuss the legal and ethical challenges for working with data from the web.

Essential readings:

M. J. Salganik., Bit by bit: Social research in the digital age., Princeton University Press, 2019. – Chapter 6

Seminar application: Web-scraping and using APIs.

6. Causal Inference with Text Data

How should we use quantitative text data when we are interested in estimating causal relationships? Causal questions are often central to social scientific research, but measuring causal effects with text data is difficult because language is inherently multidimensional. This week, we will spend time thinking about how to use text data in the context of causal analyses, particularly focusing on text as an outcome, text as a treatment, and text as a control variable.

Essential readings:

J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 25, 26 and 27
J. D. Angrist and J. Pischke., Masterin’ Metrics: The Path from Cause to Effect., Princeton University Press, 2014. – For those requiring an introduction/review of causal inference methods.

Recommended readings:

N. Egami, C. J. Fong, J. Grimmer, et al., “How to make causal inferences using texts”., In: Science Advances 8.42 (2022), p. eabg2652.
C. Fong and J. Grimmer., “Causal inference with latent treatments”., In: American Journal of Political Science (2021).
M. E. Roberts, B. M. Stewart, and R. A. Nielsen., “Adjusting for confounding with text matching”., In: American Journal of Political Science 64.4 (2020), pp. 887–903.

Seminar application: Using text as an outcome and text as a treatment.

7. Representing Text as Data (II): Word Embeddings

How can we represent the “meaning” of words using quantitative methods? One answer to this question is that we can understand a word’s meaning by looking at the “company it keeps”. That is, we can glean something about the substantive meaning of a word by paying attention to the specific contexts in which words appear. “Word-embeddings” – a class of techniques which infer word meaning from the distribution of words that surround each term in the text – represent words as vectors of numbers, which turn out to have pretty remarkable properties. This week we will discuss the idea behind word-embeddings, strategies for estimating embedding vectors, and applications of embedding-techniques to several social science questions.

Essential readings:

J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 7 and 8
P. L. Rodriguez and A. Spirling., “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research”., In: The Journal of Politics 84.1 (2022), pp. 101–115.

Recommended readings:

L. Hargrave and J. Blumenau., “No Longer Conforming to Stereotypes? Gender, Political Style and Parliamentary Debate in the UK”., In: British Journal of Political Science (2021), pp. 1–18.
A. Caliskan, J. J. Bryson, and A. Narayanan., “Semantics derived automatically from language corpora contain human-like biases”., In: Science 356.6334 (2017), pp. 183–186.
E. Rodman., “A timely intervention: Tracking the changing meanings of political concepts with word vectors”., In: Political Analysis 28.1 (2020), pp. 87–111.
A. C. Kozlowski, M. Taddy, and J. A. Evans., “The geometry of culture: Analyzing the meanings of class through word embeddings”., In: American Sociological Review 84.5 (2019), pp. 905–949.
P. L. Rodriguez, A. Spirling, and B. M. Stewart., “Embedding Regression: Models for Context-Specific Description and Inference”., In: American Political Science Review (2021), pp. 1–20.

Seminar application: Using word-embeddings to calculate similarities, compute analogies, and expanding dictionaries.

8. Representing Text as Data (III): Word Sequences

In this lecture, we will explore how sequences of words can be represented as data. Building on the representation of words using ‘dense vectors’ last week, this week we will learn about how to generate dense vector representations of longer text sequences (sentences, paragraphs, documents).

To do this, we will go beyond ‘bag-of-words’ models of text, and begin using models that can capture the sequential dependencies between words. We will learn about ‘contextualised’ word embeddings, which capture the meaning of a specific token in the context of the surrounding text. We will also learn about how these contextualised embeddings form the basis for representing whole text sequences in a meaningful semantic space (sentence/document embeddings). These methods rely on neural network models - we will save the technical details of this for next week, focusing instead on the intuition and applications.

With these new tools, we will revisit the task of clustering texts to see if we can improve on methods introduced earlier in the course (e.g. topic models). We will also see how these dense representations of text sequences can be used for information retrieval and supervised learning.

Essential readings:

N. Smith, “Contextual Word Representations: A Contextual Introduction”, Arxiv (2020)
D. Card et al, “Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration”, PNAS, (2022)

Recommended readings:

N. Reimers, I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, (2019)
D. Jurafsky and J. H. Martin, “Speech and Language Processing” – Chapter 3
HuggingFace tutorials on subword tokenisation: (1) Byte-Pair encoding tokenisation; and (3) Wordpiece tokenisation

Seminar application: Applying sequence models to text analysis tasks.

9. Language Models (III): Neural Networks, Transfer Learning, and Transformer Models

This week, we examine how cutting-edge language models harness the power of neural networks and transfer learning to achieve state-of-the-art performance in text analysis tasks.

We will start by introducing the concept of neural networks and how they can learn complex relationships in text data. We will discuss key architectures like feed-forward networks and recurrent models (RNNs and LSTMs), highlighting their strengths and limitations in natural language processing (NLP).

Next, we will introduce the Transformer architecture, which forms the backbone of many modern language models, including BERT and GPT. We will explain how Transformers use self-attention mechanisms to process entire sequences in parallel, enabling more efficient and scalable text representations.

We will also cover transfer learning, emphasising how pre-trained language models can be adapted for specific tasks through fine-tuning, saving time and computational resources. Finally, we will introduce HuggingFace’s ecosystem as a practical tool for working with such models.

Essential readings:

D. Jurafsky and J. H. Martin, “Speech and Language Processing” – Chapters 7 and 9
M. Laurer et al., “Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI”, In: Political Analysis, (2023)

Recommended readings:

D. Jurafsky and J. H. Martin, “Speech and Language Processing” – Chapter 8
HuggingFace NLP course: Introduction to Transformers
N. Webersinke, M. Kraus, J. Bingler, and M. Leippold, “ClimateBert: A Pretrained Language Model for Climate-Related Text”, Arxiv, (2022)
Attention in Transformers: step-by-step (Visualisation of the attention mechanism)

Seminar application: Fine-tuning Transformer-based models.

1. Representing Text As Data (I): Bag-of-Words

2. Similarities, Differences, Complexity

3. Language Models (I): Supervised Learning for Text

4. Language Models (II): Topic Models

5. Designing Research Projects and Collecting Text Data

6. Causal Inference with Text Data

7. Representing Text as Data (II): Word Embeddings

8. Representing Text as Data (III): Word Sequences

9. Language Models (III): Neural Networks, Transfer Learning, and Transformer Models

10. Language Models (IV): Generative Language Models in Social Science Research