We do not have knowledge of a thing until we have grasped its why, that is to say, its cause.
Aristotle, Physics
Throughout this course we have been interested in text-as-data approaches to quantitative measurement
Typically, social scientists are not only interested in measuring different quantities, but rather are also interested in explaining how these quantities vary
Many of the most interesting and important research questions investigate causal relationships between phenomena
Causal inference involves questions about counterfactual outcomes
Quantitative text analyses that answer causal questions are a new and important area of research
Causality: The relationship between events where one set of events (the effects/outcomes) is a direct consequence of another set of events (the causes/treatments).
Causal Inference: The process by which one can use data to make claims about causal relationships.
Goal: Identify the effect of a treatment variable, \(D_i\), on an outcome variable \(Y_i\), sometimes by use of covariates \(X_i\).
Definition: Outcome
\(Y_i\) is the observed value of the outcome variable of interest for unit \(i\). For example:
(Defined for binary case, but we can generalise to continuous outcomes)
Definition: Treatment
\(D_i\): treatment status (causal variable) for unit \(i\)
\[ D_i = \left\{ \begin{array}{ll} 1 & \mbox{if unit $i$ received the treatment}\\ 0 & \mbox{otherwise}. \end{array} \right. \]
E.g.
(Defined for binary case, but we can generalise to continuous treatments)
Definition: Covariate
\(X_i\): Observed covariate of interest for unit \(i\)
Definition: Potential Outcome
\(Y_{0i}\) and \(Y_{1i}\): Potential outcomes for unit \(i\)
\[ Y_{di} = \left\{ \begin{array}{ll} Y_{1i} & \mbox{Potential outcome for unit $i$ with treatment}\\ Y_{0i} & \mbox{Potential outcome for unit $i$ without treatment} \end{array} \right. \]
If \(D_i = 1\), only \(Y_{1i}\) is realised/observed. \(Y_{0i}\) is what the outcome would have been if \(D_i\) had been 0
If \(D_i = 0\), only \(Y_{0i}\) is realised/observed. \(Y_{1i}\) is what the outcome would have been if \(D_i\) had been 1
\(\rightarrow\) potential outcomes are fixed attributes for each \(i\) and represent the outcome that would be observed hypothetically if \(i\) were treated/untreated
Definition: Causal Effect
For each unit \(i\), the causal effect of the treatment on the outcome is defined as the difference between its two potential outcomes: \[ \tau_i \equiv Y_{1i} - Y_{0i} \]
\(\tau_i\) is the difference between two hypothetical states of the world
Definition: Fundamental Problem of Causal Inference
We cannot observe both potential outcomes \((Y_{1i},Y_{0i})\) for the same unit \(i\)
Causal inference is difficult because it is about something we can never see.
–Paul Rosenbaum, Observation and Experiment
\[ \text{DIGM} \equiv \frac{1}{m}\sum_{i=1}^m Y_{i} - \frac{1}{n-m}\sum_{i=m+1}^n Y_{i} \]
Problem: \(\text{DIGM}\) captures two different quantities:
For example:
\(Y_i = 1\) if speech \(i\) contains aggressive language
\(Y_i = 0\) if speech \(i\) does not contain aggressive language
\(D_i = 1\) if the speaker for speech \(i\) is male
\(D_i = 0\) if the speaker for speech \(i\) is female
Imagine that:
\[ \frac{1}{\# \ of \ men}\sum_{i=1}^m Y_{i} - \frac{1}{\# \ of \ women}\sum_{i=m+1}^n Y_{i} > 0 \]
What would you conclude?
Treatment effect: Men are more aggressive than women
Selection effect:
Confounding/selection bias occurs if and only if there is an omitted variable, \(X\), that:
If such a variable exists, then the simple comparison of group means will not measure the causal effect of \(D\) on \(Y\).
Several commonly used strategies to overcome the selection bias problem:
Randomization (experiments)
Controlling for Confounders (regression, matching)
Other strategies (RDD, IV, etc)
Every week on this course we have studied methods that allow us to take a corpus and create low-dimensional summaries of the texts contained within it
We can think of each summary as a function, \(g()\), which we apply to the dfm word counts, \(W_i\) and which results in a particular quantity, \(\pi_i\), for each document
Each mapping function produces a simplification of the text
There is no single “correct” mapping function, as it will depend on the research question of interest
Researchers typically do not know \(g()\) before they have seen their data
We typically need to look at the data in order to work out a good mapping function
Good measurement often requires iterating between exploration and validation, so looking at the data is essential to generating valid \(g()\) functions
Text data might be used to construct measures for any of \(Y\), \(D\), or \(X\).
Text as Outcome
Text as Treatment
Text as Control
In each case, the conclusions of our causal analysis will depend on the mapping function used.
Text is a rich source of information about the opinions, views and responses of individuals.
Most instances so far in political science of people collecting large text datasets have been text as outcome
Also includes a long history of manual coding of open-ended survey responses and manual content analysis of documents.
Estimand: Text as Outcome
\[ \tau_i = E[Y_{1i} - Y_{0i}] = E[g(W_{1i}) - g(W_{0i})] \]
Intuition: We want to know how the potential outcome of the mapping function, \(g()\), differs between treatment and control conditions.
What assumptions do we need in order to estimate this quantity?
Independence assumption: text as outcome
\[ g(W_{0i}), g(W_{1i}) \perp\!\!\!\perp D_i \]
or
\[ g(W_{0i}), g(W_{1i}) \perp\!\!\!\perp D_i|X_x \]
Intuition:
If \(D\) is randomized (independent of the potential outcomes), then we can use the difference in means as an estimator of the average treatment effect.
If \(D\) is as good as random, conditional on \(X\), then we can estimate the average treatment effect by controlling for \(X\)
The existence of the mapping function in any causal text analysis can create difficulties.
Problem one: Overfitting and Fishing
Typically, we look at the data in order to generate a good mapping function
This creates the potential for overfitting and fishing
E.g. continuously “refine” a dictionary until the scores from that dictionary differed significantly between treatment and control units
Problem two: Violations of Causal Assumptions
Causal inference methods require assumptions to identify average causal effects
A key assumption – Stable Unit Treatment Value Assumption (SUTVA) – requires that the each unit’s potential outcomes are not affected by another unit’s treatment status
In many methods we study, the mapping function, \(g()\) depends on the texts of all units. E.g.
These represent SUTVA violations by design, because the measure for each unit will be dependent on all other units
Problem: simultaneously discovering \(g()\) and estimating sources of variation (causes) in \(g()\) can lead to erroneous conclusions.
Solution: Split the discovery and estimation of variation in \(g()\) into separate parts of the research process
Three alternative approaches:
Define g()
before looking at documents
Run sequential studies
Use a train/test split
Problem: simultaneously discovering \(g()\) and estimating variation in \(g()\) can lead to erroneous conclusions.
Solution: Split the discovery and estimation of variation in \(g()\) into separate parts of the research process
This data was analysed using a structural topic model (i.e. g()
)
Treatment
A 28-year-old single man, a citizen of another country, was convicted of illegally entering the US. Prior to this offense, he had served two previous prison sentences each more than a year. One of these previous sentences was for a violent crime and he had been deported back to his home country.
Control
A 28-year-old single man, a citizen of another country, was convicted of illegally entering the US. Prior to this offense, he had never been imprisoned before.
g()
includes an element of manual coding)Please fill in the midterm survey
Estimand: Text as Treatment
\[ E[Y_{1i} - Y_{0i}] \]
where
\(Y_{1i}\) is the potential outcome for unit \(i\) when the text assigned to unit \(i\) has a value of \(g(W_i) = 1\)
\(Y_{0i}\) is the potential outcome for unit \(i\) when the text assigned to unit \(i\) has a value of \(g(W_i) = 0\)
Intuition: We want to know how the potential outcomes for a unit differ between treatment and control conditions, where the treatment status is determined by the text.
Previously, we said that the difference in means is an unbiased estimator if the treatment is randomized.
Independence assumption: text as treatment
\[ Y_{0i},Y_{1i} \perp\!\!\!\perp g(W_i) \]
Here we also require that the output of the mapping function does not correlate with other features of the texts that might affect the outcome:
Sufficiency assumption: text as treatment
We have to assume either that
Implication: Randomization of texts alone is insufficient to identify the causal effect of a latent treatment.
Grimmer and Fong (2021) investigate which topics of Donald Trump tweets (\(Y\)) were most/least appealing to voters (\(D\))
g()
is a type of topic model (g()
) estimated on 752 Tweets \(\rightarrow D_i\), the topic of each tweet
Donald Trump tweets randomly allocated to online survey respondents who evaluate them (5-point scale, great to terrible)
Question: Does this experiment allow us to estimate the causal effect of the topics on tweet favourability?
Potential problem: If the topics correlate with other features of the text (\(Z\)), we cannot necessarily attribute observed differences in tweet favourability to the topics!
Which other features might correlate with these topics?
Whenever we have a latent treatment concept we have to assume that our measured treatment, \(g(W_i)\) is uncorrelated with any other unmeasured treatment in the text
We can try and control for potentially confounding treatments, but we have to be able to work out what they are and how to measure them!
Text is often used as a basis of treatments in social science experiments, though typically it is not treated as “data” in any systematic way
A very common form of experiment is a survey experiment in which some respondents are exposed to a treatment text while others are exposed to a control text
Texts are thought to differ in terms of some underlying concept of theoretical interest
This is really just a human-constructed mapping function! \(g(W_i)\) is constructed by the researcher before the experiment such that
\(D_i = g(W_i) = 1\) if the researcher deems text \(i\) to be representative of the concept of interest
\(D_i = g(W_i) = 0\) if the researcher deems text \(i\) to be unrepresentative of the concept of interest
It remains possible that the treatment texts written by a researcher differ in multiple (unintended) ways
Let’s imagine that we are interested in assessing the effectiveness of different forms of political rhetoric:
Key issue: the treatment we wish to test is latent, and we cannot directly manipulate latent properties of the text.
How might we estimate the effect of the latent concept of interest?
Write texts that differ only in terms of the latent concept
Write multiple texts for each latent concept and marginalise over confounding features
In this paper, we use option two:
14 different rhetorical strategies
12 different policy issues
The estimate for any individual text might be confounded by any unmeasured latent treatment
The average effect for a given rhetorical style could be confounded if an unmeasured latent treatment is correlated with the style across texts
There is little evidence that text-based confounding is a problem in this application.
Definition: Independence assumption, text as confounder:
\[ Y_1,Y_0 \perp\!\!\!\perp D|g(W_i) \]
Intuition: When used as a control, we are assuming that once we condition on the text, \(W_i\), via some low dimensional summary (g()
), the potential outcomes are independent of our treatment.
Question: How do we “control” for the “content” of a text?
The “content” of a text doesn’t have a well-defined operationalization
The difficulty is that we do not know a priori which aspects of the text are related to both treatment and outcome
The choice of g()
will likely be important and lead to different substantive answers
There is no statistical answer to this question! We need to think hard about the context and select the representation that we believe captures the relevant confounding concept
Research question: Does female authorship reduce citations?
In political science, articles written by women receive fewer citations on average than articles written by men
Even when controlling for tenure, rank, university, and publication venue, the gender-citation gap persists
However, author gender is not randomly assigned \(\rightarrow\) we cannot necessarily interpret this difference as causal
Why?
Women may write about different topics than men, and the textual content of the article might determine citations
\(\rightarrow\) text might be a confounder of the relationship between gender and citations
The most common approach for adjusting potential confounders in the social sciences is to include measures for those confounding factors in a regression
Given that for each document, our dfm records the number of times each word occurs within that document, can we just use our dfm for \(X\)?
The difficulty with this approach is that the dfm is very high-dimensional, normally with many more variables (\(P\)) than observations (\(N\))
Standard regression techniques break down when \(N<P\), meaning that we cannot simply include everything in an OLS model
Two Strategies:
Control for words in a penalized regression (e.g. Lasso regression; ridge regression)
Control for low-dimensional summary of texts, rather than for words directly (e.g. topic model)
\[\arg \min_\beta \sum_{i}^N \left( y_i - \alpha - \sum_{j=1}^J\beta_jx_{i,j}\right)\]
\[\arg \min_\beta \sum_{i}^N \left( y_i - \alpha - \sum_{j=1}^J\beta_jx_{i,j}\right) + \color{red}{\lambda\sum_{j=1}^J\beta_j}\]
In this example, the authors regress the citations on…
The Lasso model selects the words and covariates that are highly predictive of the outcome
Highly predictive words include
cause
, effect
, z-score
)democ
, anticolon
, surgenc
)weingast
, shepsl
)Even controlling for all these words, the coefficient on the treatment variable is still negative
The Lasso model is excellent at selecting predictors of citations, but has some limitations:
An alternative is to first generate a low-dimensional summary (g()
) of the texts, and then include the summary in a regression as a control
In this case, it makes sense to use a topic model: topical differences between articles are likely a/the source of confounding of the relationship between gender and citations
Approach:
Estimate: \(Y_i = \alpha + \beta_0 Female_i + \sum_{k=1}^K\beta_k\theta_{k,i}\)
where \(\theta_{k,i}\) is the proportion of document \(i\) devoted to topic \(k\) from a structural topic modell.
Findings:
Papers about “war;conflict;…”, “trade;intern;…”, “intern;polit;…” and “variabl;model;…” receive more citations, on average
Even controlling for all these topics, the coefficient on the treatment variable is still negative
The intersection of causal inference and quantitative text analysis is a new frontier in quantitative social science research
In each case, this requires mapping the high-dimensional texts to a lower-dimensional summary to be included in the analyses
The use of a mapping function raises a set of issues that we must consider carefully in order to make valid inferences
PUBL0099