Schedule

1. Text As Data

In the first week of the course we will learn about typical goals of quantitative text analysis projects. We will focus on the fact that a unifying goal of almost all text analysis projects is the measurement of some kind of social concept. We will discuss some of the core challenges of using statistical methods to characterise latent concepts using text data and we will explore the criteria that we would ideally like to meet in applied projects.

We will also learn about how to represent text as quantitative data. We will discuss the assumptions we make (often implicitly) when defining our corpora and how we represent texts in quantitative forms. In practical terms, we will learn about document-feature matrices, feature selection, stemming, lemmatization, and n-grams. We will also discuss how selection of documents and features can be consequential for the outcomes and conclusions of any text-as-data analyses.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 3, 4 and 5
  2. K. Benoit., “Text as data: An overview”., In: The SAGE Handbook of Research Methods in Political Science and International Relations, SAGE Publishing, London (forthcoming) (2020). – Available here

Recommended readings:

  1. J. Grimmer and B. M. Stewart., “Text as data: The promise and pitfalls of automatic content analysis methods for political texts”., In: Political Analysis 21.3 (2013), pp. 267–297.
  2. M. J. Denny and A. Spirling., “Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it”., In: Political Analysis 26.2 (2018), pp. 168–189.

Seminar application: Introduction to quanteda.

2. Dictionaries

This week we will consider dictionary methods, which form a bridge between traditional qualitative approaches to text analysis and quantitative methods that we study on this course. We will introduce some of the principals of good measurement in social science and discuss the essential role of validation in any text analysis project. We will learn how to apply and interpret dictionaries to capture a range of concepts relevant to a wide variety of questions in social science.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 15 and 16
  2. B. E. Lauderdale., Pragmatic Social Measurement., In Progress, 2022. – Chapter 3, “Measurement Error”, available here

Recommended readings:

  1. A. D. Kramer, J. E. Guillory, and J. T. Hancock., “Experimental evidence of massive-scale emotional contagion through social networks”., In: Proceedings of the National Academy of Sciences 111.24 (2014), pp. 8788–8790.
  2. T. Loughran and B. McDonald., “When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks”., In: The Journal of Finance 66.1 (2011), pp. 35–65.
  3. L. Hargrave and J. Blumenau., “No Longer Conforming to Stereotypes? Gender, Political Style and Parliamentary Debate in the UK”., In: British Journal of Political Science (2021), pp. 1–18.

Seminar application: Constructing, applying and validating dictionaries.

3. Similarity, Difference and Complexity

This week we will study methods for grouping texts that are similar and distinguishing between texts that are different. We will learn about the vector space model, which connects to many of the approaches that we cover later in the course. We will also learn about weighting strategies for text – such as tf-idf weighting – and we will learn about why wordclouds suck fail to use the full set of visual dimensions available for communicating meaning. Finally, we will consider methods for measuring the lexical complexity and readability of different texts.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 7 and 11

Recommended readings:

  1. A. Hager and H. Hilbig., “Does public opinion affect political speech?”, In: American Journal of Political Science 64.4 (2020), pp. 921–937.
  2. B. L. Monroe, M. P. Colaresi, and K. M. Quinn., “Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict”., In: Political Analysis 16.4 (2008), pp. 372–403.
  3. K. Benoit, K. Munger, and A. Spirling., “Measuring and explaining political sophistication through textual complexity”., In: American Journal of Political Science 63.2 (2019), pp. 491–508.
  4. A. Spirling., “Democratization and linguistic complexity: The effect of franchise extension on parliamentary discourse, 1832–1915”., In: The Journal of Politics 78.1 (2016), pp. 120–136.

Seminar application: Applying similarity metrics and describing differences between groups of texts.

4. Supervised Learning for Text

This week we will study methods for categorising texts into sets of pre-defined categories which is an example of supervised machine learning. We will focus on the basic mechanics that structure a supervised learning analysis; introduce the Naive Bayes classifier as a model for text classification; and discuss strategies for assessing predictive performance including training and test sets and cross-validation.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 18, 19, and 20
  2. G. James, D. Witten, T. Hastie, et al., An Introduction to Statistical Learning., Springer, 2022. – Chapter 5.1, “Cross-Validation”

Recommended readings:

  1. S. Müller., “The temporal focus of campaign communication”., In: The Journal of Politics 84.1 (2022), pp. 585–590.

Seminar application: Supervised learning for text classification.

5. Designing Research Projects and Collecting Text Data

Thus far we have focused on ways of representing and manipulating digitized collections of text. But how do we go about collecting texts that we want to analyse? This week we will discuss how data collection should be pursued when conducting academic research. We will think about different modes of data collection and the advantages and disadvantages of each of them when designing social research projects. We will cover several strategies for collecting text data, including using APIs and web-scraping. We will also discuss the legal and ethical challenges for working with data from the web.

Essential readings:

  1. M. J. Salganik., Bit by bit: Social research in the digital age., Princeton University Press, 2019. – Chapter 6

Seminar application: Web-scraping and using APIs.

6. Text Scaling Models

When people read political texts, they often want to use them to infer the ideology or political position of the authors of those texts. Is it possible to do this type of inference using quantitative representations of texts? This week we study a series of approaches which aim to construct measures of political ideology from text. We discuss the assumptions that underpin these methods, strategies for validating them, and details of estimation and implementation.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 21
  2. M. Laver, K. Benoit, and J. Garry., “Extracting policy positions from political texts using words as data”., In: American political science review 97.2 (2003), pp. 311–331.
  3. J. B. Slapin and S. Proksch., “A scaling model for estimating time-series party positions from texts”., In: American Journal of Political Science 52.3 (2008), pp. 705–722.

Recommended readings:

  1. T. O’Grady., “Careerists versus coal-miners: welfare reforms and the substantive representation of social groups in the British Labour Party”., In: Comparative Political Studies 52.4 (2019), pp. 544–578.

Seminar application: Appling Wordscores and Wordfish.

7. Topic Models

Almost all the methods we study on this course try to understand the ways in which words vary across documents and whether this variation conveys “meaning” of some sort. A key source of linguistic variation across documents is the topic to which a document is addressed. This week we study a class of models that can be used to infer the topics present in a set of documents and, subsequently, the degree to which each document is relevant to each topic. Topic models have become extraordinarily popular throughout the social sciences, largely thanks to the fact that they are easy to apply to very large text corpora. We will introduce a canonical topic model – Latent Dirichlet Allocation (LDA) – and will discuss strategies for selecting the appropriate number of topics and for validating topic models. We will also discuss a number of extensions to this model.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 12 and 13
  2. D. M. Blei., “Probabilistic topic models”., In: Communications of the ACM 55.4 (2012), pp. 77–84.
  3. M. E. Roberts, B. M. Stewart, D. Tingley, et al., “Structural topic models for open-ended survey responses”., In: American journal of political science 58.4 (2014), pp. 1064–1082.

Recommended readings

  1. L. Ying, J. M. Montgomery, and B. M. Stewart., “Topics, concepts, and measurement: a crowdsourced procedure for validating topics as measures”., In: Political Analysis (2021), pp. 1–20.
  2. D. M. Butler and J. L. Sutherland., “Have State Policy Agendas Become More Nationalized?”, In: The Journal of Politics 85.1 (2023), pp. 000–000.

Seminar application: Applying and interpreting structural topic models.

8. Word Embeddings

How can we represent the “meaning” of words using quantitative methods? One answer to this question is that we can understand a word’s meaning by looking at the “company it keeps”. That is, we can glean something about the substantive meaning of a word by paying attention to the specific contexts in which words appear. “Word-embeddings” – a class of techniques which infer word meaning from the distribution of words that surround each term in the text – represent words as vectors of numbers, which turn out to have pretty remarkable properties. This week we will discuss the idea behind word-embeddings, strategies for estimating embedding vectors, and applications of embedding-techniques to several social science questions.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 7 and 8
  2. P. L. Rodriguez and A. Spirling., “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research”., In: The Journal of Politics 84.1 (2022), pp. 101–115.

Recommended readings:

  1. L. Hargrave and J. Blumenau., “No Longer Conforming to Stereotypes? Gender, Political Style and Parliamentary Debate in the UK”., In: British Journal of Political Science (2021), pp. 1–18.
  2. A. Caliskan, J. J. Bryson, and A. Narayanan., “Semantics derived automatically from language corpora contain human-like biases”., In: Science 356.6334 (2017), pp. 183–186.
  3. E. Rodman., “A timely intervention: Tracking the changing meanings of political concepts with word vectors”., In: Political Analysis 28.1 (2020), pp. 87–111.
  4. A. C. Kozlowski, M. Taddy, and J. A. Evans., “The geometry of culture: Analyzing the meanings of class through word embeddings”., In: American Sociological Review 84.5 (2019), pp. 905–949.
  5. P. L. Rodriguez, A. Spirling, and B. M. Stewart., “Embedding Regression: Models for Context-Specific Description and Inference”., In: American Political Science Review (2021), pp. 1–20.

Seminar application: Using word-embeddings to calculate similarities, compute analogies, and expanding dictionaries.

9. Causal Inference with Text

How should we use quantitative text data when we are interested in estimating causal relationships? Causal questions are often central to social scientific research, but measuring causal effects with text data is difficult because language is inherently multidimensional. This week, we will spend time thinking about how to use text data in the context of causal analyses, particularly focusing on text as an outcome, text as a treatment, and text as a control variable.

Essential readings:

  1. J. Grimmer, M. E. Roberts, and B. M. Stewart., Text as Data: A New Framework for Machine Learning and the Social Sciences., Princeton University Press, 2022. – Chapters 25, 26 and 27
  2. J. D. Angrist and J. Pischke., Masterin’ Metrics: The Path from Cause to Effect., Princeton University Press, 2014. – For those requiring an introduction/review of causal inference methods.

Recommended readings:

  1. N. Egami, C. J. Fong, J. Grimmer, et al., “How to make causal inferences using texts”., In: Science Advances 8.42 (2022), p. eabg2652.
  2. C. Fong and J. Grimmer., “Causal inference with latent treatments”., In: American Journal of Political Science (2021).
  3. M. E. Roberts, B. M. Stewart, and R. A. Nielsen., “Adjusting for confounding with text matching”., In: American Journal of Political Science 64.4 (2020), pp. 887–903.

Seminar application: Using text as an outcome and text as a treatment.

10. Review and Large Language Models

This week we review some of the general lessons we have learned throughout the course, drawing connections between the methods that we have studied. We will also introduce large language models – such as BERT and ChatGPT – which represent the current state-of-the-art in quantitative text analysis. We will develop a heuristic understanding of how these models work, consider some potential applications of such models in the social sciences, and discuss some of the ethical issues that arise from such models.

Essential readings

  1. G. Marcus., “AI’s Jurassic Park Moment”., In: Substack: The Road to AI We Can Trust (2022)., URL: https://garymarcus.substack.com/p/ais-jurassic-park-moment.
  2. M. Shanahan., “Talking About Large Language Models”., In: arXiv preprint arXiv:2212.03551 (2022).

Seminar application: Developing your project.