6.1 Lecture slides
6.2 Text as Treatment; Text as Outcome
In today’s seminar we will use two datasets to explore the use of text as an outcome and text as a treatment in causal inference settings. In contrast to previous weeks, I have not provided any code to get you going on these questions. Instead, you are expected to go through the code from previous weeks to make some sensible analysis decisions that will allow you to measure some interesting quantities and use these in downsteam analyses.
The solutions I provide at the end of the week will give one approach to answering the questions below, but there is no single correct “solution” here. Indeed, one of the broader learning goals here is to emphasise that the results from any given causal analysis using text-based measures is inherently dependent on the analysis decisions you make. Accordingly, if you get very different results from the solutions posted then that just illustrates the challenges associated with combining text-as-data methods and causal inference strategies.
6.3 Text as Outcome
What is the effect of anxiety on attitudes towards immigrants? Gadarian and Albertson (2013) examine how negative emotions can influence political behavior and attitudes. They conduct an experiment in which they induce anxiety about immigration for a set of survey respondents.1 In order to evoke anxiety about immigration, respondents in the treatment condition read the following prompt:
1 Whether this type of experimental prompt is effective at stimulating the desired emotional state among survey respondents is the subject of ongoing debate. We will focus on the effects of this treatment on the open-text outcome, side-stepping the issue of whether the treatment is actually encouraging anxiety or something else.
Now, we’d like you to take a moment to think about the debate over immigration in the United States. When you think about immigration what makes you worried? Please list everything that comes to mind.
By contrast, respondents in the control group were asked to list everything that came to mind when they thought about immigration with the prompt:
First, we’d like you to take a moment to think about the debate over immigration in the United States. When you think about immigration, what do you think of?
Both sets of respondents were provided with text boxes in which they could list their thoughts. Our goal will be to analyse these texts, asking whether the anxiety-inducing prompt caused respondents to give systematically different sets of responses than the generic thought-listing prompt.
6.3.1 Data
The immig_thoughts.csv
data contains two variables:
Variable | Description |
---|---|
treat |
Variable indicating whether the respondent was in the treatment group ("worried" ) or the control group ("think" ). |
response |
The text produced by the respondent in response to the prompt. |
6.3.2 Tasks
- Why does the random assignment of the treatment allow us to make causal inferences in the context of this example? Why do Gadarian and Albertson not simply compare people who have higher levels of concern about immigration to those with lower levels of concern?
Reveal code
The random assignment of individuals to each condition means that, in expectation, the two groups of respondents will be identical to each other in terms of observed and unobserved covariates. That is, selection bias is not generally a concern when the the treatment is randomly assigned.
By contrast, were we to simply compare the respondents with higher and lower levels of concern about immigration, we would be concerned that any differences in the outcome between those two groups might reflect some source of confounding. For instance, those who are more concerned about immigration might have different ideologies than those who are less concerned, or might come from different parts of the country, or might have different education levels, and so on. Therefore in order to identify the causal effect of immigration anxiety, we need a design that avoids sources of confounding like these – and that is what the randomized treatment assignment provides.
- Choose a measurement strategy that we have covered at some point on this course to represent the texts in the
response
variable. Think about the types of information that might be present in the texts and which might be linked to the treatment of interest. You can choose any method we have studied – dictionaries, topic models, supervised learning, etc – but your choice should be informed by the substantive case we are working with.
Reveal code
You could have selected any number of potential measurement strategies here. I have opted to fit a structural topic model with 10 topics and with the
treat
variable as a prevalence covariate.
library(stm)
# Create corpus
immig_corpus <- immig %>%
corpus(text_field = "response")
# Create DFM
immig_dfm <- immig_corpus %>%
tokens(remove_punct = T) %>%
dfm() %>%
dfm_remove(stopwords("en"))
# Estimate STM
stm_out <- stm(immig_dfm,
K = 10,
prevalence = ~treat,
verbose = FALSE) # verbose = FALSE prevents stm from printing model estimation updates
- “Validate” the output of your measurement approach (i.e. look at the high and low scoring documents, examine any estimated parameters of interest, etc).
Reveal code
Topic 1 Top Words:
Highest Prob: think, country, coming, legally, need, immigration, people
FREX: coming, think, legally, country, everyone, becoming, entering
Lift: accidents, actively, adapting, alien, allegiance, american's, americas
Score: think, coming, history, citizen, entering, receive, need
Topic 2 Top Words:
Highest Prob: jobs, americans, security, crime, social, illegals, take
FREX: crime, social, jobs, security, americans, healthcare, cost
Lift: crime, accountability, activity, altime, amercia, amounts, anxious
Score: jobs, cost, social, security, crime, lost, healthcare
Topic 3 Top Words:
Highest Prob: border, people, mexicans, citizens, mexico, immigrants, country
FREX: mexicans, wall, fences, border, political, rate, another
Lift: 18, 95, abroad, access, across, although, arizona
Score: mexicans, fences, another, political, consider, future, mostly
Topic 4 Top Words:
Highest Prob: immigrants, illegal, services, laws, paid, aliens, paying
FREX: immigrants, paid, services, losing, aliens, laws, illegal
Lift: abolition, accomplish, activities, allour, allowing, ancestors, assumption
Score: paid, laws, congress, failure, kicked, lobbyists, seem
Topic 5 Top Words:
Highest Prob: people, work, legal, think, us, immigrants, come
FREX: work, difficult, process, like, legal, $, freedoms
Lift: =, $, 125, 600, adjust, afford, agree
Score: process, $, vs, even, think, freedoms, difficult
Topic 6 Top Words:
Highest Prob: taxes, english, pay, people, better, jobs, life
FREX: taxes, nothing, better, pay, one, worried, make
Lift: agreed, alcohol, ancious, attention, blood, breed, child
Score: taxes, nothing, pay, worried, make, one, families
Topic 7 Top Words:
Highest Prob: poor, immigrants, system, care, health, legal, english
FREX: poor, system, terrorists, criminals, build, communities, contribute
Lift: #1, accross, americanized, annoying, anymore, asians, barriers
Score: poor, wanting, causing, dobbs, last, lou, grandparents
Topic 8 Top Words:
Highest Prob: people, immigrants, many, language, states, worry, come
FREX: states, united, things, worry, able, language, many
Lift: 11, 12, 9, abused, accomodated, accomodating, actually
Score: assimilate, things, pay, states, united, resources, language
Topic 9 Top Words:
Highest Prob: immigration, workers, economy, usa, worry, nation, concerns
FREX: concerns, state, workers, economy, bad, potential, usa
Lift: affects, afraid, along, alter, alternative, among, bad
Score: state, concerns, nation, workers, economy, capitalist, painting
Topic 10 Top Words:
Highest Prob: people, illegal, immigration, us, taking, schools, dont
FREX: dont, hospitals, strain, schools, taking, us, problems
Lift: aboration, abuses, allowed, anymore.and, arent, assilum, assistance
Score: us, im, taking, dont, stop, hospitals, problems
Topic 1:
when i think of immigration i think of people who enter this country legally, who go through the proper immigration process, no matter how long it takes. i think of people who are willing to learn the english language, make an honest living, honor our country and pledge allegiance to our flag. those who come to america by any other means, who sneak in here and file false paperwork, who think they have the right to drive and have a license, who manage to obtain false ssn's don't deserve to be here, and our borders need to be much, much more secure.
Topic 2:
legally entering the usa meeting the requirements is the law. entering the usa improperly is a crime. it is unfortunate that the american bar association has fought treating entering the usa as a crime. and our politicians from both parties have been so anxious to get the "vote" that they refuse to inforce the law. meanwhile, terorists can walk right in without any problem. no accountability on the part of politicians supported by the news media will bring our nation down eventually. and the normal person will wonder how it happened.
Topic 3:
as an arizona resident who lives 18 miles from the mexican-us border, and who has also spoken to some of these illegals while hiking in the huachuca mtns., i know these people, mostly, come here out of sheer desperation. sure, some are the same lazy, fat, undereducated jerks that lurk around our own mid-level businesses. but most simply are people who want what we all do: a comfortable life with as little thinking and suffering as possible, while reproducing at will. they have told me, babies in arms,that if they remain at home, they have no future but an early death. that they, maybe, should reduce their birth rate and/or not have children at all, if they cannot support them, simply will never occur to citizens of a catholic country, living a day's walk from a rich country that can be easily milked for what they consider a fortune in life support. there is no answer to this, so long as 95% of mexico's wealth is controlled by 5% of its people, and the only riches the others have lie in their children.
Topic 4:
i firmly believe that the u.s. is a melting pot of nationalities and races -- people of different ethnic backgrounds blending into one rich culture - without losing the distinctiveness of their own culture. in order to accomplish this safely and effectively: 1)immigrants must be in our country legally 2)immigrants need to learn our language 3) immigrants need to obey our laws. fears about illegal immigrants: terrorism, criminal activities, heavy drain on social services & medical services, weighing down the public school systems esl & other special needs.
Topic 5:
i think of the american born people, and how we've sacrificed to give them their freedom. i think a lot of people have becom frustrated with it all. when white people apply for a job (minimum wage) and they are told their salary is not negotiable, and yet the minority is able to negotiate a higher wage for the same job. it is discrimination against our own. i think of what andy rooney said a while ago about this very subject and agree completely. if you come to our country - speak english. respect our laws. our country doesn't owe you anything, work for it like the rest of us! simply, i am against illegal immigrants, and legal immigrants should adjust to our ways, not us - to theirs.
Topic 6:
what really makes me worried is that we are doing nothing to fix the system i agreed that we need to pay attention to our borders but at the same time there is people hard working people that are here illegaly and they are ancious to obtain some king of work permit so they can work legally, they did cross the border ok make them pay a fine of course criminals they must be deported
Topic 7:
poor people wanting a better life because their own country is so full of
corruption. they have found it too easy to slip accross the border and our government must have some reasons for wanting them here to keep our wages lower. it has kept pur young from summer jobs. they are a major drain on
our health care system as well as all the welfare that many get. i don't what the answer is to fixing the problem with the ones already here but amnesty as it was done the last time is not the answer either. i personally helped 3 women get their papers the last time and as far as i know only 1 became a citizen. i beleive they should all speak english and we shouldn't have to pay extra so they can learn. it should not be our job t learn spanish.
Topic 8:
i am enthusiastic about legal immigrants willing to assimilate and be productive members of american society.
i worry that illegal immigrants have no incentive, and often no desire to assimilate.
i worry that illegal immigrants are disproportionally involved in violent crimes, as well as drug and property crimes.
i worry that our culture and language may be diminished by those not willing to assimilate.
i worry that that our our society devotes more and more of it's resources providing health care, education and other services to those not willing to assimilate.
i worry that our careless attitude about enforcing our border and immigration policy will lead to another 9/11 style terrorist attack.
Topic 9:
i am most worried about the conception that forms in the relation of the modern nation state to that of the foreigner. a relation of same and other is established that marginalizes the other in such a way as to turn hospitality into slavery. the largest worries of immigration manifest themselves as concerns over economic effects primarily because the nation state has communally devolved into a neoliberal capitalist organization. it is the corporisation of the state that governs the question of immigration, the question that is then framed in terms of resources and production. my worry is that capitalist democracy will continue to perpetuate itself and replace community with individualism, consumerism, and the great american freedom-the freedom to buy
Topic 10:
close borders.
fine employers who employ illegal immigrats (im).
remove children of im from public schools.
no ssa or welfare for ims.
when picked up by police or any other government institution, they should be taken into custody and deported.
if an im is deported and returns to the us, they should be jailed and the family or mexican government made to paid for the cost of upkeep.
There is some evidence here that we are capturing meaningful topics relating to different political attitudes about immigration. There appears to be topics about violence (topic 10); workers from Mexico (topic 4); taxes (topic 3); crime, hospitals and social security (topic 5); and so on.
In general, the point here is that this is one possible representation of these texts and it is a matter of judgement as to whether it is a reasonable representation. There is some evidence here of repetition across topics – that is, though they are mostly coherent, the topics may be insufficiently exclusive – and so it would perhaps be better to estimate a smaller number of topics. If we were to do so, we would of course estimate a different set of treatment effects in the next step! It is for this reason that Egami et. al. (2022) advocate for an approach in which we split the discovery and estimation steps in any causal inference process that uses text as either a treatment or an outcome.
- Estimate the effect of the treatment on the outcome variable or variables that you created in answer to the question above. Does the treatment have an effect? Is it significantly different from zero?
Reveal code
# Estimate treatment effects
stm_effects <- estimateEffect(~treat,
stm_out,
metadata = docvars(immig_dfm))
plot.estimateEffect(stm_effects,
model = stm_out,
covariate = "treat", # Covariate for which we would like to plot effects
topics = 1:10, # Topics to plot
method = "difference", # Plot difference between treatment and control
cov.value1 = "worried", # Treatment group
cov.value2 = "think", # Control group
labeltype = "frex", # Label type (frex labels)
n = 3, # Number of words to use in label
xlim = c(-.3,.2), # Limits of the x-axis
verbose.labels = FALSE) # Remove unnecessary information from labels
There is evidence of significant treatment effects for a number of these topics! Treatment group respondents speak more about violence; the threat to american jobs; and crime and hospitals. Control group respondents speak more about Mexican workers and a topic relating to ‘think, process, everyone’.
- In order to estimate a treatment effect, you had to make a series of decisions that might have some influence on your estimates. Think about these now: which decisions did you make, and did you have some principled reason for making them? Try replicating your analysis but making different decisions (e.g. change the value of \(K\) in a topic model; use different feature selection decisions; pick different training documents, etc). What are the consequences of these changes for your final estimates? What does this tell you about the challenges of using quantitative text analysis methods for making causal inferences?
Reveal code
Even conditional on selecting the STM, there are other choices I could have made – I might have made different feature selection decisions, selected a different value for K, and so on.
Let’s try changing the number of topics and see what happens to the estimated treatment effects.
library(stm)
# Estimate STM
stm_out_new <- stm(immig_dfm,
K = 15,
prevalence = ~treat,
verbose = FALSE) # verbose = FALSE prevents stm from printing model estimation updates
# Estimate treatment effects
stm_effects_new <- estimateEffect(~treat,
stm_out_new,
metadata = docvars(immig_dfm))
plot.estimateEffect(stm_effects,
model = stm_out_new,
covariate = "treat",
topics = 1:10,
method = "difference",
cov.value1 = "worried",
cov.value2 = "think",
labeltype = "frex",
n = 3,
xlim = c(-.3,.2),
verbose.labels = FALSE)
The results are similar but there are some tangible changes. In particular, we now have something that looks more specifically like a ‘language’ topic, which is used more by the treatment group and a ‘schools’ topic which is used more by the control group.
Again, the key point here is that we get very different treatment effects from different representations of the texts. We should be aware of this issue whenever using a text-based outcome in a causal inference analysis!
Create a plot which illustrates one of the treatment effect that you have estimated.
Upload your answers, code and results on this Moodle page.
6.4 Homework: Text as Treatment
What are the features of module descriptions that make modules more popular with students? Is it when they use exciting, dynamic language? Is it when they are especially readable? Is it when they suggest the course is easy? The UCL module catalogue, which we have explored in previous weeks, includes information on the number of students who enrolled on each module in previous years. In this part of the seminar, you will construct a representation of the texts of the module descriptions and use that to predict student enrollments.
6.4.1 Data
The module_catalogue.Rdata
file, which can be downloaded from the top of the page, contains the following variables:
Rows: 6,253
Columns: 12
$ teaching_department <chr> "Greek and Latin", "Greek and Latin", "Bartlett…
$ level <chr> "FHEQ Level 5", "FHEQ Level 4", "FHEQ Level 7",…
$ intended_teaching_term <chr> "Term 1|Term 2", "Term 1", "Term 1", "Term 2", …
$ credit_value <chr> "15", "15", "15", "30", "15", "15", "15", "15",…
$ mode <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
$ subject <chr> "Ancient Greek|Ancient Languages and Cultures|C…
$ keywords <chr> "ANCIENT GREEK|LANGUAGE", "ANCIENT GREEK|LANGUA…
$ description <chr> "Teaching Delivery: This module is taught in 20…
$ title <chr> "Advanced Greek A (GREK0009)", "Greek for Begin…
$ module_description <chr> "Teaching Delivery: This module is taught in 20…
$ n_students <chr> "2", "15", "22", "151", "117", "68", "0", "23",…
$ module_leader <chr> "Dr Fiachra Mac Gorain", "Dr Elena Cagnoli Fiec…
6.4.2 Tasks
- Choose a representation for the module descriptions that you think is likely to be predictive of student numbers on a module. As with the earlier task, this could come from any measurement strategy we have covered on the course. Implement your measurement strategy and use some face validity checks to convince yourself that you are capturing the concept that you intended to capture.
Reveal code
One option here would be to use a dictionary to measure some aspect of a module and evaluate whether that has an effect on the number of enrolled students. I have done that here, writing a dictionary that is intended to measure whether a module covers introductory material. This is based on the ‘theory’ that introductory modules will be more popular with students than more advanced modules.
We start by defining the dictionary and applying it to a dfm of the module descriptions.
# Define dictionary
intro_words <- c("introductory", "basic", "first_year", "essential", "foundation", "compulsory", "simple", "required", "necessary", "principles", "requirements", "fundamental", "require", "requires", "example", "requirement")
intro_dictionary <- dictionary(list(intro = intro_words))
# Create DFM
modules_corpus <- modules %>%
corpus(text_field = "module_description")
Warning: NA is replaced by empty string
For validation, we can look at the top scoring texts and we can also see how our dictionary score varies with the ‘level’ of the module. See here for a description of the meaning of these levels.
# A tibble: 19 × 2
level intro_mean
<chr> <dbl>
1 FHEQ Level 4 2.04
2 FHEQ Level 4|FHEQ Level 5 1.5
3 FHEQ Level 4|FHEQ Level 7 1.29
4 FHEQ Level 5 1.24
5 FHEQ Level 5|FHEQ Level 4 2
6 FHEQ Level 5|FHEQ Level 6 1.75
7 FHEQ Level 5|FHEQ Level 7 1.53
8 FHEQ Level 6 0.987
9 FHEQ Level 6|FHEQ Level 4 0
10 FHEQ Level 6|FHEQ Level 5 0.6
11 FHEQ Level 6|FHEQ Level 5|FHEQ Level 7 0
12 FHEQ Level 6|FHEQ Level 7 0.921
13 FHEQ Level 7 1.11
14 FHEQ Level 7|FHEQ Level 4 1.5
15 FHEQ Level 7|FHEQ Level 5 0.407
16 FHEQ Level 7|FHEQ Level 5|FHEQ Level 6 0
17 FHEQ Level 7|FHEQ Level 6 0.95
18 FHEQ Level 7|FHEQ Level 6|FHEQ Level 5 2
19 FHEQ Level 8 0.391
This is broadly encouraging – Level 4 modules are marked by higher scores on our ‘introductory’ dictionary than other modules, while level 8 modules have the lowest scores.
Do the top scoring texts also make sense?
[1] "Overview:\n\nThis module provides an introduction to Mechanical Engineering, covering fundamental concepts of Thermofluids and Applied Mechanics (Statics).\n\nThe Thermofluids part of the module aims to teach fundamentals of thermofluid sciences. Building on the mathematical skills and physics learning from the A-levels and the concurrent first year Mathematics module, the basic concepts of control volume and control mass are introduced – this teaches students how to analyse systems with and without flows. These fundamental features are then used to perform massand energy balance – both in isothermal and non-isothermal systems – with various levels of assumptions. The energy balance is introduced via the first law of Thermodynamics and solution of several analytical problems drawn from practical engineering applications.\n\nThe Statics part will aim to teach the basic analytical methods, that is, the fundamental concepts and techniques of engineering mechanics (Statics). Building on mathematical skills from A-levels Mathematics (including, for some students, Mechanics modules) and the concurrent first year Mathematics module, basic concepts of Statics are introduced, practiced and applied to simple engineering problems. Students will obtain modelling knowledge, tools and experience appropriate for a first year engineering module, providing the foundation for higher level modules.\n\nTopics covered:\n\nIntroduction to Thermofluids\n\nIntroduction to thermodynamics and related concepts of fluid mechanics\n\tAnalysing systems and devices Pressure and hydrostatic head\n\tMass balance and energy in isothermal conditions Flow analysis\n\tPrinciple of energy conservation\n\tFirst Law of Thermodynamics with flow Application of First Law\nApplied Mechanics – Statics\n\nForces and moments\n\tRigid body equilibrium\n\tFriction\n\tAnalysis of structures\n\tDistributed forces and centre of gravity\n\tInternal forces and moments in structures\nLearning outcomes:\n\nUpon completion of this module students will be able to:\n\nDemonstrate knowledge and understanding of the essential facts, concepts, theories and principles underlying fundamental thermofluids and statics.\n\tApply basic scientific principles of thermofluid sciences to solve simple engineering problems involving modelling and analysis of basic engineering systems involving: simple flow systems, using appropriate conservation principles, and applying the principles of dimensional analysis and physical similarity to engineering model testing.\n\tApply basic principles of statics and equilibrium to solve problems of structures, bridges and components under simple loading conditions and with/without friction.\n\tUse principles of equilibrium and control volume analysis to understand practical working of an engine and bridge-like structure; analyse experimental results and draw conclusions, given specific guidance to the appropriate background material answer basic questions on the operations of similar engineering systems.\n\tApply a range of techniques to analyse available evidence and solve simple engineering problems pertaining to thermofluid and equilibrium solid mechanics.\n \n"
[2] "Description \n\nThe module will provide an introduction to the basic concepts and principles of remote sensing. It will include 3 components: i) radiometric principles underlying remote sensing: electromagnetic radiation; basic laws of electromagnetic radiation; absorption, reflection and emission; atmospheric effects; radiation interactions with the surface, radiative transfer; ii) assumptions and trade-offs for particular applications: orbital mechanics and choices; spatial, spectral, temporal, angular and radiometric resolution; data pre-processing; scanners; iii) time- resolved remote sensing including: RADAR principles; the RADAR equation; RADAR resolution; phase information and SAR interferometry; LIDAR remote sensing, the LIDAR equation and applications.\n\nThe course aims to:\n\nProvide knowledge and understanding of the fundamental concepts, principles and applications of remote sensing, particularly the electromagnetic spectrum – what it is, how it is measured, and what it tells us;\n\tProvide examples of applications of principles to a variety of topics in remote sensing, particularly related to climate and environment\n\tDevelop a detailed understanding of the fundamental trade-offs in the design and applications of remote sensing tools: spatial, spectral, orbital etc.\n\tIntroduce new technologies, missions and opportunities, including ground-based sensing, lidar at multiple scales, radar, UAVs, new science and commercial missions, open data and the tools that are emerging to exploit these opportunities;\n\tIntroduce the principles of the radiative transfer problem that underpins most remote sensing measurements and how it is modelled and solved; applications of radiative transfer modelling to terrestrial vegetation;\n\tIntroduce students to wider remote sensing organisations, policy and careers through invited seminars from professionals in the field, including former RSEM students.\nSessions - all delivered by Professor Disney unless specified.\n\nIntroduction to remote sensing\n\tRadiation principles, EM spectrum, blackbody\n\tEM spectrum terms, definitions and concepts\n\tRadiative transfer principles and assumptions\n\tSpatial, spectral resolution and sampling\n\tPre-processing chain, ground segment, radiometric resolution, scanners; poster discussion\n\tActive remote sensing: LIDAR – principles and applications\n\tActive remote sensing: RADAR –principles and applications\n\tNew missions and technologies including LIDAR, UAVs, Copernicus etc.\n\tApplication discussions around assessed posters\nThird year undergraduate students selecting the module via the FHEQ Level 6 route would usually be expected to have taken GEOG0027 Remote Sensing (2nd year). It is possible to do without but you should consult Prof. Disney in the first instance on this.\n\nMode of study\n\nIn person\n\nUseful pre-requisite knowledge\n\nWhile there are no specific pre-requisites for the module, it contains basic physics and elementary maths. The maths included is GCSE-level – at most some simple geometry and algebra – and forms only a small part of the content. The module caters for students from a wide variety of backgrounds and experience has shown that students from less quantitative backgrounds typically cope fine. If you have any doubts about this aspect, by all means contact Prof. Disney and he can discuss this with you in relation to your background.\n\nTransferrable career skills developed in the module\n\nCritical thinking: ability to assess data and ideas\n\tCommunication: academic writing, through guided reading and application discussions\n\tCommunication: conveying written ideas to non-experts, via the application discussions and assessed poster\n\tPresentation skills: via the assessed poster session\n\tStatistical / quantitative analysis: introduction to simple physical principles\n\tGIS / spatial data: understanding important assumptions underpinning spatial data capture, display and presentation.\n \n"
Yes! These both describe introductory courses.
- Estimate the effect of your chosen representation on the number of students enrolled on a module. Is the effect significantly different from zero? Discuss whether this is this likely to represent a causal effect.
Reveal code
Call:
lm(formula = n_students ~ intro_dictionary, data = modules)
Residuals:
Min 1Q Median 3Q Max
-129.03 -29.06 -19.94 2.94 891.06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.0622 1.0183 28.54 <2e-16 ***
intro_dictionary 5.9392 0.4807 12.36 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 67.1 on 6251 degrees of freedom
Multiple R-squared: 0.02384, Adjusted R-squared: 0.02368
F-statistic: 152.6 on 1 and 6251 DF, p-value: < 2.2e-16
There is clear evidence of a positive association between the dictionary scores and the number of students variable. In particular, each additional word a module description uses from the ‘introduction’ dictionary is associated with about 6 additional students enrolled on the module.
We should not be confident, however, that this represents the causal effect of a module being introductory, as there may be a large number of confounding factors. For instance, it might be the case that some departments offer more introductory modules and those departments also have higher student numbers. Similarly, there may be other aspects of the module descriptions that confound this effect. For instance, perhaps the course descriptions for introductory modules are written using more exciting language, or language that is less complicated, than other modules and it is that that encourages students to enroll.
In general, as we do not have a randomly assigned treatment here, we should not interpret the naive regression estimate as representative of a true causal effect. In the next question, we will try to control for a number of factors in order to strengthen our confidence in the inferences that we make here.
- If you think it is required on the basis of your answer to question 2, re-estimate the effect of your chosen representation but this time controlling for some other variables. Does this change the effect of your concept of interest?
Reveal code
I have added controls for a number of factors here, consistent with the discussion above. In particular, in addition to controling for the department to which the module belongs, and the level of the module, I have also added controls for two additional text-based measures.
First, I have calculated the Flesch reading ease score for each module description. Second, I have applied a second dictionary which tries to capture the use of ‘exciting’ (or, more broadly, enthusiastic) language in each of the module descriptions. Both of these are plausibly confounding variables, as we might expect them to correlate with whether or not a module is introductory and also the number of students we might expect to take the module.
Once I have constructed these measures for each document, I include them in a regression alongside the
intro_dictionary
variable that I created above.
library(quanteda.textstats)
# Calculate readability scores
modules$readability <- textstat_readability(modules$module_description)$Flesch
Warning: NA is replaced by empty string
# Apply "exciting" dictionary
exciting_words <- c("exciting", "interesting", "fascinating", "extraordinary", "brilliant", "awesome", "innovative", "dynamic", "intriguing", "captivating", "engaging", "absorbing", "compelling", "thought-provoking", "entertaining", "informative", "creative", "original", "inventive", "imaginative", "ingenious", "new", "groundbreaking", "pioneering", "energetic", "active", "lively", "vibrant", "forceful", "powerful", "intense")
exciting_dictionary <- dictionary(list(exiting = exciting_words))
exciting_dfm <- modules_dfm %>% dfm_lookup(exciting_dictionary)
# Assign dictionary scores to modules data
modules$exciting_dictionary <- as.numeric(exciting_dfm[,1])
# Estimate model with controls
model_2 <- lm(n_students ~ intro_dictionary + readability + exciting_dictionary + level + teaching_department , data = modules)
# Print the first five coefficients, standard errors, p-values, etc
coef(summary(model_2))[1:5,]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 87.00264117 5.61811010 15.4861047 4.290901e-53
intro_dictionary 2.15320671 0.42820654 5.0284303 5.085292e-07
readability -0.07239206 0.03540187 -2.0448651 4.091066e-02
exciting_dictionary -0.69430941 0.71050324 -0.9772079 3.285047e-01
levelFHEQ Level 4|FHEQ Level 5 58.65871801 38.29624734 1.5317093 1.256456e-01
The results suggest that the inclusion of these variables has affected the estimated effect of the ‘introductory’ dictionary scores that we calculated previously. In this specification, although the coefficient associated with the
intro_dictionary
variable remains signficantly different from zero, it is about a third of the magnitude compared to the naive regression. This implies that there is some confounding going on here, which we are now capturing (at least some of) with the new control variables.
This example also illustrates the difficulties of making causal inferences with non-randomly assigned treatments. Does the coefficient on the treatment variable in model 2 represent the causal effect of introductory modules on student enrollement? We don’t know! In order for this to be interpreted as a causal effect, we have to be convinced that we have captured all of the potentially confounding factors. It is hard to assess whether that is true here, and so we should therefore be very cautious in our interpretation of these results.
- Create at least one plot or table which illustrates the results of your analysis. Upload it alongside a description of what you have done, and the interpretation of your result, to this Moodle page.