1. Representing Text As Data (I): Bag-of-Words
In the first week of the course we will learn about typical goals of quantitative text analysis projects. We will focus on the fact that a unifying goal of almost all text analysis projects is the measurement of some kind of social concept. We will discuss some of the core challenges of using statistical methods to characterise latent concepts using text data and we will explore the criteria that we would ideally like to meet in applied projects.
We will also learn one common approach to representing text as quantitative data – the bag-of-words approach. We will discuss the assumptions we make (often implicitly) when defining our corpora and how we represent texts in quantitative forms. In practical terms, we will learn about document-feature matrices, feature selection, stemming, lemmatization, and n-grams. We will also discuss how selection of documents and features can be consequential for the outcomes and conclusions of any text-as-data analyses.
Finally, we will consider a very simple text analysis approach – dictionary methods – which form a bridge between traditional qualitative approaches and the quantitative methods that we study on this course. We will introduce some of the principals of good measurement in social science and discuss the essential role of validation in any text analysis project. We will learn how to apply and interpret dictionaries to capture a range of concepts relevant to a wide variety of questions in social science.
Essential readings:
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 3, 4 and 5
- K. Benoit., “Text as data: An overview”., In: The SAGE Handbook of Research Methods in Political Science and International Relations, SAGE Publishing, London (forthcoming) (2020). – Available here
- B. E. Lauderdale., Pragmatic Social Measurement., In Progress, 2022. – Chapter 3, “Measurement Error”, available here
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 7 and 11
Recommended readings:
- J. Grimmer and B. M. Stewart., “Text as data: The promise and pitfalls of automatic content analysis methods for political texts”., In: Political Analysis 21.3 (2013), pp. 267–297.
- M. J. Denny and A. Spirling., “Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it”., In: Political Analysis 26.2 (2018), pp. 168–189.
- A. D. Kramer, J. E. Guillory, and J. T. Hancock., “Experimental evidence of massive-scale emotional contagion through social networks”., In: Proceedings of the National Academy of Sciences 111.24 (2014), pp. 8788–8790.
- T. Loughran and B. McDonald., “When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks”., In: The Journal of Finance 66.1 (2011), pp. 35–65.
- L. Hargrave and J. Blumenau., “No Longer Conforming to Stereotypes? Gender, Political Style and Parliamentary Debate in the UK”., In: British Journal of Political Science (2021), pp. 1–18.
Seminar application: Text-as-data.
2. Similarities, Differences, Complexity
This week we will study methods for grouping texts that are similar and distinguishing between texts that are different. We will learn about the vector space model, which connects to many of the approaches that we cover later in the course. We will also learn about weighting strategies for text – such as tf-idf weighting – and we will learn about why wordclouds suck fail to use the full set of visual dimensions available for communicating meaning. Finally, we will consider methods for measuring the lexical complexity and readability of different texts.
Essential readings:
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 15 and 16
Recommended readings:
- A. Hager and H. Hilbig., “Does public opinion affect political speech?”, In: American Journal of Political Science 64.4 (2020), pp. 921–937.
- B. L. Monroe, M. P. Colaresi, and K. M. Quinn., “Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict”., In: Political Analysis 16.4 (2008), pp. 372–403.
- K. Benoit, K. Munger, and A. Spirling., “Measuring and explaining political sophistication through textual complexity”., In: American Journal of Political Science 63.2 (2019), pp. 491–508.
- A. Spirling., “Democratization and linguistic complexity: The effect of franchise extension on parliamentary discourse, 1832–1915”., In: The Journal of Politics 78.1 (2016), pp. 120–136.
Seminar application: Constructing, applying and validating dictionaries.
Seminar application: Applying similarity metrics and describing differences between groups of texts.
3. Language Models (I): Supervised Learning for Text
This week we will study methods for categorising texts into sets of pre-defined categories which is an example of supervised machine learning. We will focus on the basic mechanics that structure a supervised learning analysis; introduce the Naive Bayes classifier as a model for text classification; and discuss strategies for assessing predictive performance including training and test sets and cross-validation.
Essential readings:
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 18, 19, and 20
- G. James, D. Witten, T. Hastie, et al., An Introduction to Statistical Learning., Springer, 2022. – Chapter 5.1, “Cross-Validation”
Recommended readings:
- S. Müller., “The temporal focus of campaign communication”., In: The Journal of Politics 84.1 (2022), pp. 585–590.
Seminar application: Supervised learning for text classification.
4. Language Models (I): Topic Models
Almost all the methods we study on this course try to understand the ways in which words vary across documents and whether this variation conveys “meaning” of some sort. A key source of linguistic variation across documents is the topic to which a document is addressed. This week we study a class of models that can be used to infer the topics present in a set of documents and, subsequently, the degree to which each document is relevant to each topic. Topic models have become extraordinarily popular throughout the social sciences, largely thanks to the fact that they are easy to apply to very large text corpora. We will introduce a canonical topic model – Latent Dirichlet Allocation (LDA) – and will discuss strategies for selecting the appropriate number of topics and for validating topic models. We will also discuss a number of extensions to this model.
Essential readings:
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 12 and 13
- D. M. Blei., “Probabilistic topic models”., In: Communications of the ACM 55.4 (2012), pp. 77–84.
- M. E. Roberts, B. M. Stewart, D. Tingley, et al., “Structural topic models for open-ended survey responses”., In: American journal of political science 58.4 (2014), pp. 1064–1082.
Recommended readings
- L. Ying, J. M. Montgomery, and B. M. Stewart., “Topics, concepts, and measurement: a crowdsourced procedure for validating topics as measures”., In: Political Analysis (2021), pp. 1–20.
- D. M. Butler and J. L. Sutherland., “Have State Policy Agendas Become More Nationalized?”, In: The Journal of Politics 85.1 (2023), pp. 000–000.
Seminar application: Applying and interpreting structural topic models.
5. Designing Research Projects and Collecting Text Data
Thus far we have focused on ways of representing and manipulating digitized collections of text. But how do we go about collecting texts that we want to analyse? This week we will discuss how data collection should be pursued when conducting academic research. We will think about different modes of data collection and the advantages and disadvantages of each of them when designing social research projects. We will cover several strategies for collecting text data, including using APIs and web-scraping. We will also discuss the legal and ethical challenges for working with data from the web.
Essential readings:
- M. J. Salganik., Bit by bit: Social research in the digital age., Princeton University Press, 2019. – Chapter 6
Seminar application: Web-scraping and using APIs.
6. Causal Inference with Text Data
How should we use quantitative text data when we are interested in estimating causal relationships? Causal questions are often central to social scientific research, but measuring causal effects with text data is difficult because language is inherently multidimensional. This week, we will spend time thinking about how to use text data in the context of causal analyses, particularly focusing on text as an outcome, text as a treatment, and text as a control variable.
Essential readings:
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 25, 26 and 27
- J. D. Angrist and J. Pischke., Masterin’ Metrics: The Path from Cause to Effect., Princeton University Press, 2014. – For those requiring an introduction/review of causal inference methods.
Recommended readings:
- N. Egami, C. J. Fong, J. Grimmer, et al., “How to make causal inferences using texts”., In: Science Advances 8.42 (2022), p. eabg2652.
- C. Fong and J. Grimmer., “Causal inference with latent treatments”., In: American Journal of Political Science (2021).
- M. E. Roberts, B. M. Stewart, and R. A. Nielsen., “Adjusting for confounding with text matching”., In: American Journal of Political Science 64.4 (2020), pp. 887–903.
Seminar application: Using text as an outcome and text as a treatment.
7. Representing Text as Data (II): Word Embeddings
How can we represent the “meaning” of words using quantitative methods? One answer to this question is that we can understand a word’s meaning by looking at the “company it keeps”. That is, we can glean something about the substantive meaning of a word by paying attention to the specific contexts in which words appear. “Word-embeddings” – a class of techniques which infer word meaning from the distribution of words that surround each term in the text – represent words as vectors of numbers, which turn out to have pretty remarkable properties. This week we will discuss the idea behind word-embeddings, strategies for estimating embedding vectors, and applications of embedding-techniques to several social science questions.
Essential readings:
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 7 and 8
- P. L. Rodriguez and A. Spirling., “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research”., In: The Journal of Politics 84.1 (2022), pp. 101–115.
Recommended readings:
- L. Hargrave and J. Blumenau., “No Longer Conforming to Stereotypes? Gender, Political Style and Parliamentary Debate in the UK”., In: British Journal of Political Science (2021), pp. 1–18.
- A. Caliskan, J. J. Bryson, and A. Narayanan., “Semantics derived automatically from language corpora contain human-like biases”., In: Science 356.6334 (2017), pp. 183–186.
- E. Rodman., “A timely intervention: Tracking the changing meanings of political concepts with word vectors”., In: Political Analysis 28.1 (2020), pp. 87–111.
- A. C. Kozlowski, M. Taddy, and J. A. Evans., “The geometry of culture: Analyzing the meanings of class through word embeddings”., In: American Sociological Review 84.5 (2019), pp. 905–949.
- P. L. Rodriguez, A. Spirling, and B. M. Stewart., “Embedding Regression: Models for Context-Specific Description and Inference”., In: American Political Science Review (2021), pp. 1–20.
Seminar application: Using word-embeddings to calculate similarities, compute analogies, and expanding dictionaries.
8. Representing Text as Data (III): Word Sequences
[Placeholder, please change!]
In this lecture, we will explore how sequences of words can be represented as data, enabling the analysis of linguistic patterns beyond individual words. We will discuss sequential models that account for the order of words in a text, capturing syntactic and semantic relationships often missed by simpler representations like bag-of-words.
We will introduce n-gram models, which represent sequences of n consecutive words, and discuss how these models can capture context and improve text analysis tasks like language modeling and sentiment analysis. We will also cover key challenges such as sparsity and computational complexity when working with larger n-grams.
Building on this, we will explore more advanced sequence models, including recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), which can learn contextual dependencies in text sequences. While we won’t delve into all the technical details, we will emphasize how these models have been used to address tasks like text generation, named entity recognition, and machine translation.
Essential readings:
- J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 8 and 9
- Maybe something from https://web.stanford.edu/~jurafsky/slp3/ ?
Recommended readings:
Seminar application: Applying sequence models to text analysis tasks.
9. Language Models (III): Neural Networks, Transfer Learning, and Transformer Models
[Placeholder, please change!]
This week, we examine how cutting-edge language models harness the power of neural networks and transfer learning to achieve state-of-the-art performance in text analysis tasks.
We will start by introducing the concept of neural networks and how they can learn complex relationships in text data. We will discuss key architectures like feed-forward networks and recurrent models (RNNs and LSTMs), highlighting their strengths and limitations in natural language processing (NLP).
Next, we will introduce the Transformer architecture, which forms the backbone of many modern language models, including BERT and GPT. We will explain how Transformers use self-attention mechanisms to process entire sequences in parallel, enabling more efficient and scalable text representations.
We will also cover transfer learning, emphasising how pre-trained language models can be adapted for specific tasks through fine-tuning, saving time and computational resources. Finally, we will introduce HuggingFace’s ecosystem as a practical tool for working with such models.
Essential readings:
- Maybe something from https://web.stanford.edu/~jurafsky/slp3/ ?
Recommended readings:
Seminar application: Fine-tuning Transformer-based models.