8  Causal Inference with Text

8.1 Text as Treatment; Text as Outcome

In today’s seminar we will use two datasets to explore the use of text as an outcome and text as a treatment in causal inference settings. In contrast to previous weeks, I have not provided any code to get you going on these questions. Instead, you are expected to go through the code from previous weeks to make some sensible analysis decisions that will allow you to measure some interesting quantities and use these in downsteam analyses.

The solutions I provide at the end of the week will give one approach to answering the questions below, but there is no single correct “solution” here. Indeed, one of the broader learning goals here is to emphasise that the results from any given causal analysis using text-based measures is inherently dependent on the analysis decisions you make. Accordingly, if you get very different results from the solutions posted then that just illustrates the challenges associated with combining text-as-data methods and causal inference strategies.

8.2 Text as Outcome

What is the effect of anxiety on attitudes towards immigrants? Gadarian and Albertson (2013) examine how negative emotions can influence political behavior and attitudes. They conduct an experiment in which they induce anxiety about immigration for a set of survey respondents.1 In order to evoke anxiety about immigration, respondents in the treatment condition read the following prompt:

  • 1 Whether this type of experimental prompt is effective at stimulating the desired emotional state among survey respondents is the subject of ongoing debate. We will focus on the effects of this treatment on the open-text outcome, side-stepping the issue of whether the treatment is actually encouraging anxiety or something else.

  • Now, we’d like you to take a moment to think about the debate over immigration in the United States. When you think about immigration what makes you worried? Please list everything that comes to mind.

    By contrast, respondents in the control group were asked to list everything that came to mind when they thought about immigration with the prompt:

    First, we’d like you to take a moment to think about the debate over immigration in the United States. When you think about immigration, what do you think of?

    Both sets of respondents were provided with text boxes in which they could list their thoughts. Our goal will be to analyse these texts, asking whether the anxiety-inducing prompt caused respondents to give systematically different sets of responses than the generic thought-listing prompt.

    8.2.1 Data

    The immig_thoughts.csv data contains two variables:

    Variable Description
    treat Variable indicating whether the respondent was in the treatment group ("worried") or the control group ("think").
    response The text produced by the respondent in response to the prompt.

    8.2.2 Tasks

    1. Why does the random assignment of the treatment allow us to make causal inferences in the context of this example? Why do Gadarian and Albertson not simply compare people who have higher levels of concern about immigration to those with lower levels of concern?

    2. Choose a measurement strategy that we have covered at some point on this course to represent the texts in the response variable. Think about the types of information that might be present in the texts and which might be linked to the treatment of interest. You can choose any method we have studied – dictionaries, topic models, supervised learning, etc – but your choice should be informed by the substantive case we are working with.

    1. “Validate” the output of your measurement approach (i.e. look at the high and low scoring documents, examine any estimated parameters of interest, etc).
    1. Estimate the effect of the treatment on the outcome variable or variables that you created in answer to the question above. Does the treatment have an effect? Is it significantly different from zero?
    1. In order to estimate a treatment effect, you had to make a series of decisions that might have some influence on your estimates. Think about these now: which decisions did you make, and did you have some principled reason for making them? Try replicating your analysis but making different decisions (e.g. change the value of \(K\) in a topic model; use different feature selection decisions; pick different training documents, etc). What are the consequences of these changes for your final estimates? What does this tell you about the challenges of using quantitative text analysis methods for making causal inferences?
    1. Create a plot which illustrates one of the treatment effect that you have estimated. Upload the plot to this Moodle page.

    8.3 Text as Treatment

    What are the features of module descriptions that make modules more popular with students? Is it when they use exciting, dynamic language? Is it when they are especially readable? Is it when they suggest the course is easy? The UCL module catalogue, which we have explored in previous weeks, includes information on the number of students who enrolled on each module in previous years. In this part of the seminar, you will construct a representation of the texts of the module descriptions and use that to predict student enrollments.

    8.3.1 Data

    The module_catalogue.Rdata file, which can be downloaded from the top of the page, contains the following variables:

    load("module_catalogue.Rdata")
    glimpse(modules)
    Rows: 6,252
    Columns: 11
    $ teaching_department <chr> "Greek and Latin", "Greek and Latin", "Bartlett Sc…
    $ level               <chr> "Level 5", "Level 4", "Level 7", "Level 5", "Level…
    $ credit_value        <chr> "15", "15", "15", "30", "15", "15", "15", "15", "6…
    $ subject             <chr> "Ancient Greek|Ancient Languages and Cultures|Clas…
    $ keywords            <chr> "ANCIENT GREEK|LANGUAGE", "ANCIENT GREEK|LANGUAGE"…
    $ description         <chr> "Teaching Delivery: This module is taught in 20 bi…
    $ title               <chr> "Advanced Greek A (GREK0009)", "Greek for Beginner…
    $ module_description  <chr> "Teaching Delivery: This module is taught in 20 bi…
    $ n_students          <chr> "2", "15", "22", "151", "117", "68", "0", "23", "3…
    $ module_leader       <chr> "Dr Fiachra Mac Gorain", "Dr Elena Cagnoli Fieccon…
    $ teaching_term       <chr> "Other", "1", "1", "2", "1", "2", "2", "1", "Other…

    8.3.2 Tasks

    1. Choose a representation for the module descriptions that you think is likely to be predictive of student numbers on a module. As with the earlier task, this could come from any measurement strategy we have covered on the course. Implement your measurement strategy and use some face validity checks to convince yourself that you are capturing the concept that you intended to capture.
    1. Estimate the effect of your chosen representation on the number of students enrolled on a module. Is the effect significantly different from zero? Discuss whether this is this likely to represent a causal effect.
    1. If you think it is required on the basis of your answer to question 2, re-estimate the effect of your chosen representation but this time controlling for some other variables. Does this change the effect of your concept of interest?
    1. Create at least one plot or table which illustrates the results of your analysis. Upload it alongside a description of what you have done, and the interpretation of your result, to this Moodle page.