9  Causal Inference with Text

9.1 Text as Treatment; Text as Outcome

In today’s seminar we will use two datasets to explore the use of text as an outcome and text as a treatment in causal inference settings. In contrast to previous weeks, I have not provided any code to get you going on these questions. Instead, you are expected to go through the code from previous weeks to make some sensible analysis decisions that will allow you to measure some interesting quantities and use these in downsteam analyses.

The solutions I provide at the end of the week will give one approach to answering the questions below, but there is no single correct “solution” here. Indeed, one of the broader learning goals here is to emphasise that the results from any given causal analysis using text-based measures is inherently dependent on the analysis decisions you make. Accordingly, if you get very different results from the solutions posted then that just illustrates the challenges associated with combining text-as-data methods and causal inference strategies.

9.2 Text as Outcome

What is the effect of anxiety on attitudes towards immigrants? Gadarian and Albertson (2013) examine how negative emotions can influence political behavior and attitudes. They conduct an experiment in which they induce anxiety about immigration for a set of survey respondents.1 In order to evoke anxiety about immigration, respondents in the treatment condition read the following prompt:

  • 1 Whether this type of experimental prompt is effective at stimulating the desired emotional state among survey respondents is the subject of ongoing debate. We will focus on the effects of this treatment on the open-text outcome, side-stepping the issue of whether the treatment is actually encouraging anxiety or something else.

  • Now, we’d like you to take a moment to think about the debate over immigration in the United States. When you think about immigration what makes you worried? Please list everything that comes to mind.

    By contrast, respondents in the control group were asked to list everything that came to mind when they thought about immigration with the prompt:

    First, we’d like you to take a moment to think about the debate over immigration in the United States. When you think about immigration, what do you think of?

    Both sets of respondents were provided with text boxes in which they could list their thoughts. Our goal will be to analyse these texts, asking whether the anxiety-inducing prompt caused respondents to give systematically different sets of responses than the generic thought-listing prompt.

    9.2.1 Data

    The immig_thoughts.csv data contains two variables:

    Variable Description
    treat Variable indicating whether the respondent was in the treatment group ("worried") or the control group ("think").
    response The text produced by the respondent in response to the prompt.

    9.2.2 Tasks

    1. Why does the random assignment of the treatment allow us to make causal inferences in the context of this example? Why do Gadarian and Albertson not simply compare people who have higher levels of concern about immigration to those with lower levels of concern?
    Reveal code

    The random assignment of individuals to each condition means that, in expectation, the two groups of respondents will be identical to each other in terms of observed and unobserved covariates. That is, selection bias is not generally a concern when the the treatment is randomly assigned.

    By contrast, were we to simply compare the respondents with higher and lower levels of concern about immigration, we would be concerned that any differences in the outcome between those two groups might reflect some source of confounding. For instance, those who are more concerned about immigration might have different ideologies than those who are less concerned, or might come from different parts of the country, or might have different education levels, and so on. Therefore in order to identify the causal effect of immigration anxiety, we need a design that avoids sources of confounding like these – and that is what the randomized treatment assignment provides.
    1. Choose a measurement strategy that we have covered at some point on this course to represent the texts in the response variable. Think about the types of information that might be present in the texts and which might be linked to the treatment of interest. You can choose any method we have studied – dictionaries, topic models, supervised learning, etc – but your choice should be informed by the substantive case we are working with.
    Reveal code

    You could have selected any number of potential measurement strategies here. I have opted to fit a structural topic model with 10 topics and with the treat variable as a prevalence covariate.

    library(stm)
    
    # Create corpus
    immig_corpus <- immig %>%
      corpus(text_field = "response") 
    
    # Create DFM
    immig_dfm <- immig_corpus %>%
      tokens(remove_punct = T) %>%
      dfm() %>%
      dfm_remove(stopwords("en"))
    
    # Estimate STM
    stm_out <- stm(immig_dfm,
                   K = 10,
                   prevalence = ~treat, 
                   verbose = FALSE) # verbose = FALSE prevents stm from printing model estimation updates
    1. “Validate” the output of your measurement approach (i.e. look at the high and low scoring documents, examine any estimated parameters of interest, etc).
    Reveal code
    # View topics
    labelTopics(stm_out)
    Topic 1 Top Words:
         Highest Prob: immigrants, people, citizens, country, u.s, issue, want 
         FREX: issue, support, find, u.s, culture, anything, terrorism 
         Lift: --, -english, -jobs, -more, -us, 18, 95 
         Score: issue, support, find, congress, consider, heavy, kicked 
    Topic 2 Top Words:
         Highest Prob: people, jobs, many, better, life, way, states 
         FREX: states, better, united, back, life, way, one 
         Lift: activity, advantage, along, americanized, barriers, beauacrats, bending 
         Score: states, united, back, wanting, one, send, better 
    Topic 3 Top Words:
         Highest Prob: security, social, welfare, immigrants, english, healthcare, job 
         FREX: healthcare, security, social, unfair, society, welfare, law 
         Lift: 11, 9, aboration, accountability, actively, adjust, agree 
         Score: assimilate, security, society, social, healthcare, politicians, law 
    Topic 4 Top Words:
         Highest Prob: immigration, think, need, legal, workers, legally, borders 
         FREX: workers, need, immigration, legal, much, process, long 
         Lift: #2, alilens, allegiance, alter, arent, asians, bc 
         Score: legal, immigration, need, process, think, workers, fine 
    Topic 5 Top Words:
         Highest Prob: illegal, immigrants, border, mexico, government, paying, etc 
         FREX: illegal, control, fences, mexican, im, mexico, deported 
         Lift: =, agents, assilum, assistance, car, child, contries 
         Score: illegal, control, im, mexican, fences, mexico, close 
    Topic 6 Top Words:
         Highest Prob: jobs, taxes, pay, take, taking, citizens, care 
         FREX: taking, taxes, take, pay, jobs, getting, loss 
         Lift: accomodated, accomodating, actually, alcohol, annoying, assimilated, bankrupt 
         Score: pay, taxes, jobs, taking, take, crime, loss 
    Topic 7 Top Words:
         Highest Prob: immigrants, get, many, english, americans, u.s, 1 
         FREX: 1, 3, u.s, drain, contribute, native, medical 
         Lift: 125, 600, abolition, accross, act, affects, afford 
         Score: 1, 3, recieve, afford, last, drain, wanting 
    Topic 8 Top Words:
         Highest Prob: people, country, coming, us, work, think, poor 
         FREX: coming, illegally, anyone, poor, people, us, country 
         Lift: abuses, access, adapting, agreed, americas, ancious, anymore.and 
         Score: coming, illegally, poor, anyone, people, countries, something 
    Topic 9 Top Words:
         Highest Prob: illegals, americans, come, benefits, get, immigrants, cost 
         FREX: illegals, benefits, cost, just, americans, increasing, lived 
         Lift: 12, afraid, allour, allowed, allowing, america's, amounts 
         Score: cost, illegals, benefits, failure, values, robbing, dobbs 
    Topic 10 Top Words:
         Highest Prob: hospitals, care, schools, increased, immigration, mexico, state 
         FREX: hospitals, state, increased, building, deal, due, resources 
         Lift: -finding, abused, alternative, among, available, benifits, big 
         Score: state, hospitals, deal, due, capitalist, question, terms 
    # Plot topics
    plot(stm_out)

    findThoughts(stm_out, 
                 texts = immig$response,
                 topics = 1:10,
                 n = 1)
    
     Topic 1: 
         as an arizona resident who lives 18 miles from the mexican-us border, and who has also spoken to some of these illegals while hiking in the huachuca mtns., i know these people, mostly, come here out of sheer desperation.  sure, some are the same lazy, fat, undereducated jerks that lurk around our own mid-level businesses.  but most simply are people who want what we all do: a comfortable life with as little thinking and suffering as possible, while reproducing at will.  they have told me, babies in arms,that if they remain at home, they have no future but an early death.  that they, maybe, should reduce their birth rate and/or not have children at all, if they cannot support them, simply will never occur to citizens of a catholic country, living a day's walk from a rich country that can be easily milked for what they consider a fortune in life support.  there is no answer to this, so long as 95% of mexico's wealth is controlled by 5% of its people, and the only riches the others have lie in their children. 
     Topic 2: 
         i think if they come in to the country
    ileaglely they should be depored . do it the right way or stay the hell home
    and if you do get here the correct way learn the damn langwige(sp) and learn our trafic laws. 
     Topic 3: 
         i am enthusiastic about legal immigrants willing to assimilate and be productive members of american society.
    
    i worry that illegal immigrants have no incentive, and often no desire to assimilate.
    
    i worry that illegal immigrants are disproportionally involved in violent crimes, as well as drug and property crimes.
    
    i worry that our culture and language may be diminished by those not willing to assimilate.
    
    i worry that that our our society devotes more and more of it's resources providing health care, education and other services to those not willing to assimilate.
    
    i worry that our careless attitude about enforcing our border and immigration policy will lead to another 9/11 style terrorist attack. 
     Topic 4: 
         when i think of immigration i think of people who enter this country legally, who go through the proper immigration process, no matter how long it takes.  i think of people who are willing to learn the english language, make an honest living, honor our country and pledge allegiance to our flag.  those who come to america by any other means, who sneak in here and file false paperwork, who think they have the right to drive and have a license, who manage to obtain false ssn's don't deserve to be here, and our borders need to be much, much more secure. 
     Topic 5: 
         close borders.
    fine employers who employ illegal immigrats (im).
    remove children of im from public schools.
    no ssa or welfare for ims.
    when picked up by police or any other government institution, they should be taken into custody and deported.
    if an im is deported and returns to the us, they should be jailed and the family or mexican government made to paid for the cost of upkeep. 
     Topic 6: 
         expense.  why we are not enforcing the laws we already have on the books. possibility of terrorists getting in more easily. why are we accomodating the aliens by printing things in other languages and having to push buttons to hear things in our own language.  why do we give free medical services to them when we can even take care of our own first. insurances going up to cover immigrants.  it seems that they are better accomodated than our own citizens. they are given rights, which they have not earned. i'm retired and i'm tired of costs going up because they are able to use and abuse our system to our debtrement. 
     Topic 7: 
         poor people wanting a better life because their own country is so full of 
    corruption. they have found it too easy to slip accross the border and our government must have some reasons for wanting them here to keep our wages lower. it has kept pur young from summer jobs. they are a major drain on 
    our health care system as well as all the welfare that many get. i don't what the answer is to fixing the problem with the ones already here but amnesty as it was done the last time is not the answer either. i personally helped 3 women get their papers the last time and as far as i know only 1 became a citizen. i beleive they should all speak english and we shouldn't have to pay extra so they can learn. it should not be our job t learn spanish. 
     Topic 8: 
         i think with our (us) needing more  help here ,and less over seas  ,i think there should be a complet stop of letting people in the us so anyone of them cant kill anyone of us anymore.and all of our men and woman shoud be brought back to the us , instead of fighting a war we dont belong in ... 
     Topic 9: 
         robbing americans of their social security; rising costs of social programs to cover illegals; rising costs of incarceration due to illegals; not being able to understand what they are saying (spanish and other languages); the fact that illegal immigrants are here with their children who go through our school systems then cannot get financial aid for college because their parents never filed for naturalization leaving the kids to work dead end jobs and never get naturalized; the fact that they are doing the work that americans don't want to do anyway, and then the americans complain about it; border walls; increasing amounts of drugs in america because it is so easy to get into our country 
     Topic 10: 
         i am most worried about the conception that forms in the relation of the modern nation state to that of the foreigner.  a relation of same and other is established that marginalizes the other in such a way as to turn hospitality into slavery.  the largest worries of immigration manifest themselves as concerns over economic effects primarily because the nation state has communally devolved into a neoliberal capitalist organization.  it is the corporisation of the state that governs the question of immigration, the question that is then framed in terms of resources and production.  my worry is that capitalist democracy will continue to perpetuate itself and replace community with individualism, consumerism, and the great american freedom-the freedom to buy

    There is some evidence here that we are capturing meaningful topics relating to different political attitudes about immigration. There appears to be topics about violence (topic 10); workers from Mexico (topic 4); taxes (topic 3); crime, hospitals and social security (topic 5); and so on.

    In general, the point here is that this is one possible representation of these texts and it is a matter of judgement as to whether it is a reasonable representation. There is some evidence here of repetition across topics – that is, though they are mostly coherent, the topics may be insufficiently exclusive – and so it would perhaps be better to estimate a smaller number of topics. If we were to do so, we would of course estimate a different set of treatment effects in the next step! It is for this reason that Egami et. al. (2022) advocate for an approach in which we split the discovery and estimation steps in any causal inference process that uses text as either a treatment or an outcome.

    1. Estimate the effect of the treatment on the outcome variable or variables that you created in answer to the question above. Does the treatment have an effect? Is it significantly different from zero?
    Reveal code
    # Estimate treatment effects
    stm_effects <- estimateEffect(~treat,
                                  stm_out,
                                  metadata = docvars(immig_dfm))
    
    plot.estimateEffect(stm_effects,
                        model = stm_out,
                        covariate = "treat", # Covariate for which we would like to plot effects
                        topics = 1:10, # Topics to plot
                        method = "difference", # Plot difference between treatment and control
                        cov.value1 = "worried", # Treatment group
                        cov.value2 = "think", # Control group
                        labeltype = "frex", # Label type (frex labels)
                        n = 3, # Number of words to use in label
                        xlim = c(-.3,.2), # Limits of the x-axis
                        verbose.labels = FALSE) # Remove unnecessary information from labels

    There is evidence of significant treatment effects for a number of these topics! Treatment group respondents speak more about violence; the threat to american jobs; and crime and hospitals. Control group respondents speak more about Mexican workers and a topic relating to ‘think, process, everyone’.
    1. In order to estimate a treatment effect, you had to make a series of decisions that might have some influence on your estimates. Think about these now: which decisions did you make, and did you have some principled reason for making them? Try replicating your analysis but making different decisions (e.g. change the value of \(K\) in a topic model; use different feature selection decisions; pick different training documents, etc). What are the consequences of these changes for your final estimates? What does this tell you about the challenges of using quantitative text analysis methods for making causal inferences?
    Reveal code

    Even conditional on selecting the STM, there are other choices I could have made – I might have made different feature selection decisions, selected a different value for K, and so on.

    Let’s try changing the number of topics and see what happens to the estimated treatment effects.

    library(stm)
    # Estimate STM
    stm_out_new <- stm(immig_dfm,
                   K = 15,
                   prevalence = ~treat, 
                   verbose = FALSE) # verbose = FALSE prevents stm from printing model estimation updates
    
    # Estimate treatment effects
    stm_effects_new <- estimateEffect(~treat,
                                  stm_out_new,
                                  metadata = docvars(immig_dfm))
    
    plot.estimateEffect(stm_effects,
                        model = stm_out_new,
                        covariate = "treat",
                        topics = 1:10,
                        method = "difference",
                        cov.value1 = "worried",
                        cov.value2 = "think",
                        labeltype = "frex",
                        n = 3,
                        xlim = c(-.3,.2),
                        verbose.labels = FALSE)

    The results are similar but there are some tangible changes. In particular, we now have something that looks more specifically like a ‘language’ topic, which is used more by the treatment group and a ‘schools’ topic which is used more by the control group.

    Again, the key point here is that we get very different treatment effects from different representations of the texts. We should be aware of this issue whenever using a text-based outcome in a causal inference analysis!
    1. Create a plot which illustrates one of the treatment effect that you have estimated.

    2. Upload your answers, code and results on this Moodle page.

    9.3 Homework: Text as Treatment

    What are the features of module descriptions that make modules more popular with students? Is it when they use exciting, dynamic language? Is it when they are especially readable? Is it when they suggest the course is easy? The UCL module catalogue, which we have explored in previous weeks, includes information on the number of students who enrolled on each module in previous years. In this part of the seminar, you will construct a representation of the texts of the module descriptions and use that to predict student enrollments.

    9.3.1 Data

    The module_catalogue.Rdata file, which can be downloaded from the top of the page, contains the following variables:

    load("module_catalogue.Rdata")
    glimpse(modules)
    Rows: 6,253
    Columns: 12
    $ teaching_department    <chr> "Greek and Latin", "Greek and Latin", "Bartlett…
    $ level                  <chr> "FHEQ Level 5", "FHEQ Level 4", "FHEQ Level 7",…
    $ intended_teaching_term <chr> "Term 1|Term 2", "Term 1", "Term 1", "Term 2", …
    $ credit_value           <chr> "15", "15", "15", "30", "15", "15", "15", "15",…
    $ mode                   <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
    $ subject                <chr> "Ancient Greek|Ancient Languages and Cultures|C…
    $ keywords               <chr> "ANCIENT GREEK|LANGUAGE", "ANCIENT GREEK|LANGUA…
    $ description            <chr> "Teaching Delivery: This module is taught in 20…
    $ title                  <chr> "Advanced Greek A (GREK0009)", "Greek for Begin…
    $ module_description     <chr> "Teaching Delivery: This module is taught in 20…
    $ n_students             <chr> "2", "15", "22", "151", "117", "68", "0", "23",…
    $ module_leader          <chr> "Dr Fiachra Mac Gorain", "Dr Elena Cagnoli Fiec…

    9.3.2 Tasks

    1. Choose a representation for the module descriptions that you think is likely to be predictive of student numbers on a module. As with the earlier task, this could come from any measurement strategy we have covered on the course. Implement your measurement strategy and use some face validity checks to convince yourself that you are capturing the concept that you intended to capture.
    Reveal code

    One option here would be to use a dictionary to measure some aspect of a module and evaluate whether that has an effect on the number of enrolled students. I have done that here, writing a dictionary that is intended to measure whether a module covers introductory material. This is based on the ‘theory’ that introductory modules will be more popular with students than more advanced modules.

    We start by defining the dictionary and applying it to a dfm of the module descriptions.

    # Define dictionary
    intro_words <- c("introductory", "basic", "first_year", "essential", "foundation", "compulsory", "simple", "required", "necessary", "principles", "requirements", "fundamental", "require", "requires", "example", "requirement")
    
    intro_dictionary <- dictionary(list(intro = intro_words))
    
    # Create DFM
    modules_corpus <- modules %>% 
      corpus(text_field = "module_description")
    Warning: NA is replaced by empty string
    modules_dfm <- modules_corpus %>% 
      tokens() %>% 
      tokens_ngrams(1:2) %>% 
      dfm()
    
    # Apply dictionary
    intro_dfm <- modules_dfm %>% 
      dfm_lookup(intro_dictionary)
    
    # Assign dictionary scores to modules data
    modules$intro_dictionary <- as.numeric(intro_dfm[,1])

    For validation, we can look at the top scoring texts and we can also see how our dictionary score varies with the ‘level’ of the module. See here for a description of the meaning of these levels.

    modules %>%
      group_by(level) %>%
      summarise(intro_mean = mean(intro_dictionary))
    # A tibble: 19 × 2
       level                                  intro_mean
       <chr>                                       <dbl>
     1 FHEQ Level 4                                2.04 
     2 FHEQ Level 4|FHEQ Level 5                   1.5  
     3 FHEQ Level 4|FHEQ Level 7                   1.29 
     4 FHEQ Level 5                                1.24 
     5 FHEQ Level 5|FHEQ Level 4                   2    
     6 FHEQ Level 5|FHEQ Level 6                   1.75 
     7 FHEQ Level 5|FHEQ Level 7                   1.53 
     8 FHEQ Level 6                                0.987
     9 FHEQ Level 6|FHEQ Level 4                   0    
    10 FHEQ Level 6|FHEQ Level 5                   0.6  
    11 FHEQ Level 6|FHEQ Level 5|FHEQ Level 7      0    
    12 FHEQ Level 6|FHEQ Level 7                   0.921
    13 FHEQ Level 7                                1.11 
    14 FHEQ Level 7|FHEQ Level 4                   1.5  
    15 FHEQ Level 7|FHEQ Level 5                   0.407
    16 FHEQ Level 7|FHEQ Level 5|FHEQ Level 6      0    
    17 FHEQ Level 7|FHEQ Level 6                   0.946
    18 FHEQ Level 7|FHEQ Level 6|FHEQ Level 5      2    
    19 FHEQ Level 8                                0.391

    This is broadly encouraging – Level 4 modules are marked by higher scores on our ‘introductory’ dictionary than other modules, while level 8 modules have the lowest scores.

    Do the top scoring texts also make sense?

    modules$module_description[order(modules$intro_dictionary, decreasing = T)[1:2]]
    [1] "Overview:\n\nThis module provides an introduction to Mechanical Engineering, covering fundamental concepts of Thermofluids and Applied Mechanics (Statics).\n\nThe Thermofluids part of the module aims to teach fundamentals of thermofluid sciences. Building on the mathematical skills and physics learning from the A-levels and the concurrent first year Mathematics module, the basic concepts of control volume and control mass are introduced – this teaches students how to analyse systems with and without flows. These fundamental features are then used to perform massand energy balance – both in isothermal and non-isothermal systems – with various levels of assumptions. The energy balance is introduced via the first law of Thermodynamics and solution of several analytical problems drawn from practical engineering applications.\n\nThe Statics part will aim to teach the basic analytical methods, that is, the fundamental concepts and techniques of engineering mechanics (Statics). Building on mathematical skills from A-levels Mathematics (including, for some students, Mechanics modules) and the concurrent first year Mathematics module, basic concepts of Statics are introduced, practiced and applied to simple engineering problems. Students will obtain modelling knowledge, tools and experience appropriate for a first year engineering module, providing the foundation for higher level modules.\n\nTopics covered:\n\nIntroduction to Thermofluids\n\nIntroduction to thermodynamics and related concepts of fluid mechanics\n\tAnalysing systems and devices Pressure and hydrostatic head\n\tMass balance and energy in isothermal conditions Flow analysis\n\tPrinciple of energy conservation\n\tFirst Law of Thermodynamics with flow Application of First Law\nApplied Mechanics – Statics\n\nForces and moments\n\tRigid body equilibrium\n\tFriction\n\tAnalysis of structures\n\tDistributed forces and centre of gravity\n\tInternal forces and moments in structures\nLearning outcomes:\n\nUpon completion of this module students will be able to:\n\nDemonstrate knowledge and understanding of the essential facts, concepts, theories and principles underlying fundamental thermofluids and statics.\n\tApply basic scientific principles of thermofluid sciences to solve simple engineering problems involving modelling and analysis of basic engineering systems involving: simple flow systems, using appropriate conservation principles, and applying the principles of dimensional analysis and physical similarity to engineering model testing.\n\tApply basic principles of statics and equilibrium to solve problems of structures, bridges and components under simple loading conditions and with/without friction.\n\tUse principles of equilibrium and control volume analysis to understand practical working of an engine and bridge-like structure; analyse experimental results and draw conclusions, given specific guidance to the appropriate background material answer basic questions on the operations of similar engineering systems.\n\tApply a range of techniques to analyse available evidence and solve simple engineering problems pertaining to thermofluid and equilibrium solid mechanics.\n \n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
    [2] "Aims:\n\nIn this foundation module, students will be made aware of the influences which have helped to shape the built environment. This is mostly considered from a technological standpoint but with reference to socio-economic factors, regulation and control where pertinent. The concept of the Building Team and its major players is also discussed.\n\nIt is intended to provide a general background to building and constructional techniques together with an overall philosophy as to why buildings are constructed in the way they are, and how this can affect the management of the process. This will enable students to progress to modern design techniques and realise the evolution of the contemporary built form.\n\nThe building technology is placed within the context of current performance, planning and environmental legislation. Different traditional and modern construction methods, materials and technology are introduced, including basic mechanical and electrical services engineering. Historical development of methods and materials is also discussed.\n\nObjectives: \n\nTo show the diversity of the building process and the wide range of interests involved. It assumes that students will learn construction best by starting with small simple buildings (such as low rise domestic dwellings) and then proceeding in later modules – to more complex structures.\n\nTo develop an understanding of the constituent components and materials in low rise domestic construction.\n To appreciate the production process and the work of the building team\n To be aware of the history of techniques and processes used in the construction process\n To understand the influence of style and the technological causes of change on development\n To understand the development of regulation and control of the built environment.\n To develop an understanding of simple building services.\n To understand the organisation and management principles that need to be applied to the construction process.\nLearning Outcomes: \n\nOn the completion of this module students should have an in depth understanding of the processes necessary for the construction of simple domestic buildings, and be capable of identifying construction processes, materials, regulation and management required to successfully complete a construction project. In particular, students should be able to:\n\nRecognise and understand the work and responsibilities of members of the building team.\n Understand the purpose of planning and building control legislation\n Gain a working knowledge of simple methods of construction, for all elements of a domestic dwelling.\n Have an understanding of simple services design and installation for domestic dwellings.\n Understand the responsibilities of managing the construction process.\nSyllabus:\n\nThe Building Team\n\nThe roles of the architect, quantity surveyor, engineer, project manager, construction manager and other professional parties\n The role of the contractor\n The role of the client\n Roles of other parties\nConstruction Technology\n\nThe environmental role of buildings\n Definition of primary & secondary elements, and finishes\n Substructure including excavation, simple foundation types & basements\n Superstructure concentrating on simple load bearing structures. Walls, floors, roofs and their functions in the external envelope.\n Performance requirements for doors, windows staircases\n Finishes to walls, roofs, floors and ceilings.\n Defects in buildings\n Services including hot and cold water systems, heating, above and below ground drainage. Simple electrical systems\nRegulation and Control\n\nIntroduction to planning\n Introduction to building regulations and building control\nManagement of the Construction Process\n\nRoles of management in the process\n Impact of choice of site and materials on the building process\n Introduction to planning, simple bar charts and networks.\nReading List\n\nEssential Reading\n\nChudley R. Green R, Hurst M, & Topliss S (2011) Construction Technology. 5th Ed. Routledge.\n\nRiley M, Cotgrove A (2013) Construction Technology 1 House Construction. 3rd Ed.\n\nBailey G (2017) Lecture Notes\n\nFurther Reading\n\nGreen R, Osbourne D (2014) Mitchells Introduction to Building 3rd Edition Routledge\n\nChudley R, Greeno R (2016) Building Construction Handbook. 11th Ed. Routledge\n\nHall F (2016) Building Services & Equipment Vol 1. Routledge\n\nTrucker R, Alford S, (2014) Building Regulations Explained in Brief. Routledge.\n\nBillington MJ Crooks A Building Regulations Explained & Illustrated (2017) Wiley\n\n\n"

    Yes! These both describe introductory courses.

    1. Estimate the effect of your chosen representation on the number of students enrolled on a module. Is the effect significantly different from zero? Discuss whether this is this likely to represent a causal effect.
    Reveal code
    model_1 <- lm(n_students ~ intro_dictionary, data = modules)
    
    summary(model_1)
    
    Call:
    lm(formula = n_students ~ intro_dictionary, data = modules)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -129.19  -29.05  -19.95    2.95  891.05 
    
    Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
    (Intercept)       29.0511     1.0185   28.52   <2e-16 ***
    intro_dictionary   5.9495     0.4811   12.37   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 67.1 on 6251 degrees of freedom
    Multiple R-squared:  0.02388,   Adjusted R-squared:  0.02372 
    F-statistic: 152.9 on 1 and 6251 DF,  p-value: < 2.2e-16

    There is clear evidence of a positive association between the dictionary scores and the number of students variable. In particular, each additional word a module description uses from the ‘introduction’ dictionary is associated with about 6 additional students enrolled on the module.

    We should not be confident, however, that this represents the causal effect of a module being introductory, as there may be a large number of confounding factors. For instance, it might be the case that some departments offer more introductory modules and those departments also have higher student numbers. Similarly, there may be other aspects of the module descriptions that confound this effect. For instance, perhaps the course descriptions for introductory modules are written using more exciting language, or language that is less complicated, than other modules and it is that that encourages students to enroll.

    In general, as we do not have a randomly assigned treatment here, we should not interpret the naive regression estimate as representative of a true causal effect. In the next question, we will try to control for a number of factors in order to strengthen our confidence in the inferences that we make here.

    1. If you think it is required on the basis of your answer to question 2, re-estimate the effect of your chosen representation but this time controlling for some other variables. Does this change the effect of your concept of interest?
    Reveal code

    I have added controls for a number of factors here, consistent with the discussion above. In particular, in addition to controling for the department to which the module belongs, and the level of the module, I have also added controls for two additional text-based measures.

    First, I have calculated the Flesch reading ease score for each module description. Second, I have applied a second dictionary which tries to capture the use of ‘exciting’ (or, more broadly, enthusiastic) language in each of the module descriptions. Both of these are plausibly confounding variables, as we might expect them to correlate with whether or not a module is introductory and also the number of students we might expect to take the module.

    Once I have constructed these measures for each document, I include them in a regression alongside the intro_dictionary variable that I created above.

    library(quanteda.textstats)
    Warning: package 'quanteda.textstats' was built under R version 4.3.1
    # Calculate readability scores
    modules$readability <- textstat_readability(modules$module_description)$Flesch
    Warning: NA is replaced by empty string
    # Apply "exciting" dictionary
    exciting_words <- c("exciting", "interesting", "fascinating", "extraordinary", "brilliant", "awesome", "innovative", "dynamic", "intriguing", "captivating", "engaging", "absorbing", "compelling", "thought-provoking", "entertaining", "informative", "creative", "original", "inventive", "imaginative", "ingenious", "new", "groundbreaking", "pioneering", "energetic", "active", "lively", "vibrant", "forceful", "powerful", "intense")
    
    exciting_dictionary <- dictionary(list(exiting = exciting_words))
    
    exciting_dfm <- modules_dfm %>% dfm_lookup(exciting_dictionary)
    
    # Assign dictionary scores to modules data
    modules$exciting_dictionary <- as.numeric(exciting_dfm[,1])
    
    
    # Estimate model with controls
    model_2 <- lm(n_students ~ intro_dictionary + readability + exciting_dictionary + level + teaching_department , data = modules)
    
    # Print the first five coefficients, standard errors, p-values, etc
    coef(summary(model_2))[1:5,]
                                      Estimate  Std. Error    t value     Pr(>|t|)
    (Intercept)                    86.99130885  5.61836175 15.4833941 4.468539e-53
    intro_dictionary                2.15644194  0.42863270  5.0309786 5.018428e-07
    readability                    -0.07246599  0.03540012 -2.0470548 4.069513e-02
    exciting_dictionary            -0.69203782  0.71051022 -0.9740012 3.300943e-01
    levelFHEQ Level 4|FHEQ Level 5 58.66619762 38.29615979  1.5319081 1.255966e-01

    The results suggest that the inclusion of these variables has affected the estimated effect of the ‘introductory’ dictionary scores that we calculated previously. In this specification, although the coefficient associated with the intro_dictionary variable remains signficantly different from zero, it is about a third of the magnitude compared to the naive regression. This implies that there is some confounding going on here, which we are now capturing (at least some of) with the new control variables.

    This example also illustrates the difficulties of making causal inferences with non-randomly assigned treatments. Does the coefficient on the treatment variable in model 2 represent the causal effect of introductory modules on student enrollement? We don’t know! In order for this to be interpreted as a causal effect, we have to be convinced that we have captured all of the potentially confounding factors. It is hard to assess whether that is true here, and so we should therefore be very cautious in our interpretation of these results.

    1. Create at least one plot or table which illustrates the results of your analysis. Upload it alongside a description of what you have done, and the interpretation of your result, to this Moodle page.