5 Regression (Specification)

5.1 Overview

In the lecture this week, we discuss the multiple linear regression model. We extended the simple linear regression model introduced in the previous week to enable us to look at multiple explanatory variables at once. We then discussed a variety of ways that we can use multiple regression to describe associations between different types of variables. We introduced categorical variables as well as interactions and non-linear functions of variables. We will introduce the concept of model fit, and particularly focus on \(R^2\) as a statistic to summarise the predictive performance of our models.

In seminar this week, we will cover the following topics:

  1. Use of the lm() command to fit multiple linear regression models in R.
  2. Use of the screenreg() command to compare differently specified multiple regression models.
  3. Interpretation of categorical variables, interactions between variables and non-linear functions of variables.
  4. \(R^2\) and adjusted-\(R^2\)

Before coming to the seminar

  1. Please read chapter 4, “Prediction” in Quantitative Social Science: An Introduction

5.1.1 Installing and loading packages

This week we will be using some additional functions that do not come preinstalled with R. There are many additional packages for R for many different types of quantitative analysis. Today we will be using the texreg package, which provides helpful functions for presenting the output of regression models.

To get started, you will need to install this packages on whatever computer you are using. Note: you only need to install a package on a computer once. Do not run this code every time you run your R script!


Once installed, you can load the packages using the library() function.


Now you will be able to access the functions you need for the seminar and homework.

5.2 Seminar

Why do the majority of voters in the U.S. and other developed countries oppose increased immigration? According to the conventional wisdom and many economic theories, people simply do not want to face additional competition on the labor market (economic threat hypothesis). Nonetheless, most comprehensive empirical tests have failed to confirm this hypothesis and it appears that people often support policies that are against their personal economic interest. At the same time, there has been growing evidence that immigration attitudes are rather influenced by various deep-rooted ethnic and cultural stereotypes (cultural threat hypothesis). Given the prominence of workers’ economic concerns in the political discourse, how can these findings be reconciled?

This exercise is based in part on Malhotra, N., Margalit, Y. and Mo, C.H., 2013. “Economic Explanations for Opposition to Immigration: Distinguishing between Prevalence and Conditional Impact.” American Journal of Political Science, Vol. 38, No. 3, pp. 393-433.

The authors argue that, while job competition is not a prevalent threat and therefore may not be detected by aggregating survey responses, its conditional impact in selected industries may be quite sizable. To test their hypothesis, they conduct a unique survey of Americans’ attitudes toward H-1B visas. A plurality of H-1B visas in the US are granted to Indian immigrants, who are high skilled but ethnically distinct, which enables the authors to measure a specific skill set (high technology) that is threatened by a particular type of immigrant (H-1B visa holders). The data set immig.csv has the following variables:

Name Description
age Age (in years)
female 1 indicates female; 0 indicates male
employed 1 indicates employed; 0 indicates unemployed
nontech.whitcol 1 indicates non-tech white-collar work (e.g., law)
tech.whitcol 1 indicates high-technology work
expl.prejud Explicit negative stereotypes about Indians (continuous scale, 0-1)
impl.prejud Implicit bias against Indian Americans (continuous scale, 0-1)
h1bvis.supp Support for increasing H-1B visas (5-point scale, 0-1)
indimm.supp Support for increasing Indian immigration (5-point scale, 0-1)

The main outcome of interest (h1bvis.supp) was measured as a following survey item: “Some people have proposed that the U.S. government should increase the number of H-1B visas, which are allowances for U.S. companies to hire workers from foreign countries to work in highly skilled occupations (such as engineering, computer programming, and high-technology). Do you think the U.S. should increase, decrease, or keep about the same number of H-1B visas?” Another outcome (indimm.supp) similarly asked about the “the number of immigrants from India.” Both variables have the following response options: 0 = “decrease a great deal”, 0.25 = “decrease a little”, 0.5 = “keep about the same”, 0.75 = “increase a little”, 1 = “increase a great deal”.

To measure explicit stereotypes (expl.prejud), respondents were asked to evaluate Indians on a series of traits: capable, polite, hardworking, hygienic, and trustworthy. All responses were then used to create a scale lying between 0 (only positive traits of Indians) to 1 (no positive traits of Indians). Implicit bias (impl.prejud) is measured via the Implicit Association Test (IAT) which is an experimental method designed to gauge the strength of associations linking social categories (e.g., European vs Indian American) to evaluative anchors (e.g., good vs bad). Individual who are prejudiced against Indians should be quicker at making classifications of faces and words when European American (Indian American) is paired with good (bad) than when European American (Indian American) is paired with bad (good). If you want, you can test yourself here.

Save the dataset to the same location on your computer as you have done in previous weeks, and then load it into an object called immig as below:

immig <- read.csv("data/immig.csv")

Question 1

Start by examining the distribution of immigration attitudes using table() and prop.table(). What is the proportion of people who are willing to increase the quota for high-skilled foreign professionals (h1bvis.supp) or support immigration from India (indimm.supp)?

Reveal answer

##          0       0.25        0.5       0.75          1 
## 0.30748663 0.22727273 0.29857398 0.10249554 0.06417112 
##          0       0.25        0.5       0.75          1 
## 0.28787879 0.18092692 0.39839572 0.09982175 0.03297683

About half of all voters would like to decrease the number of H-1B visas and Indian immigration (30+23=53% and 29+18=47% respectively) and about a third would like to maintain the status quo (30% and 40%). At the same time, only a minority of voters would like to see immigration increased (10+6=16% and 10+3=13%).

Now compare the distribution of two distinct measures of cultural threat: explicit stereotyping about Indians (expl.prejud) and implicit bias against Indian Americans (impl.prejud). First create a scatterplot with explicit prejudice on the x-axis and implicit bias on the y-axis, then add a linear regression line to it and calculate the correlation coefficient. Based on these results, what can you say about their relationship?

Reveal answer

plot(immig$expl.prejud,immig$impl.prejud, xlab = "Explicit prejudice", ylab = "Implicit prejudice")
fit1 <- lm(impl.prejud ~ expl.prejud , data = immig)
abline(fit1, col = "red")
legend(x='topright', bty = "n", legend=paste('Cor =', round(cor(immig$expl.prejud, immig$impl.prejud, use = "complete.obs"), 2)))

The scatterplot shows that people of low or moderate explicit prejudice are more or less equally likely to have any level of implicit prejudice. However, almost none of the respondents of high explicit prejudice have low implicit prejudice. Overall, while the relationship between implicit and explicit prejudice is positive (as can be expected), the correlation coefficient is very low (<0.1) which may indicate that these are in fact two distinct attitudes. (Or alternatively, that one or both of them are poorly measured!)

Question 2

Compute the correlations between all four policy attitude and cultural threat measures. Is cultural threat an important predictor of immigration attitudes as claimed in the literature?

Reveal answer

cor(immig[c("expl.prejud", "impl.prejud", "h1bvis.supp", "indimm.supp")], use = "complete.obs")
##             expl.prejud impl.prejud h1bvis.supp indimm.supp
## expl.prejud  1.00000000  0.06612533  -0.1612061  -0.3221206
## impl.prejud  0.06612533  1.00000000  -0.1133121  -0.1296632
## h1bvis.supp -0.16120610 -0.11331209   1.0000000   0.6107116
## indimm.supp -0.32212057 -0.12966322   0.6107116   1.0000000

Both measures of cultural threat are negatively related (from -0.11 to -0.32) to both measures of immigration support. The correlation is particularly strong (-0.32) when it comes to the link between stereotypes about Indians and attitudes toward Indian immigration.

If the labor market hypothesis is correct, opposition to H-1B visas should also be more pronounced among those who are economically threatened by this policy such as individuals in the high-technology sector. At the same time, tech workers should not be more or less opposed to general Indian immigration because of any economic considerations. Seperately fit regressions predicting H-1B and Indian immigration attitudes on the indicator variable for tech workers (tech.whitcol). Do the results support the hypothesis? Is the relationship different from the one involving cultural threat and, if so, how?

Reveal answer

lm(h1bvis.supp ~ tech.whitcol, data = immig)
lm(indimm.supp ~ tech.whitcol, data = immig)
## Call:
## lm(formula = h1bvis.supp ~ tech.whitcol, data = immig)
## Coefficients:
##  (Intercept)  tech.whitcol  
##      0.34995      -0.05334  
## Call:
## lm(formula = indimm.supp ~ tech.whitcol, data = immig)
## Coefficients:
##  (Intercept)  tech.whitcol  
##     0.352540     -0.005082

Overall, the results provide some support for the labor market hypothesis. As expected, while tech workers are slightly more opposed to H-1B visas, they are not more opposed to Indian immigration in general (compare to those who are not tech workers). As one may expect, this relationship is also in contrast with the one of cultural threat which is negatively related to both measures of immigration attitudes.

Question 3

When examining hypotheses, it is always important to have an appropriate comparison group. One might argue that comparing tech workers to everybody else as we did in Question 2 may be problematic due to a variety of confounding variables (such as skill level and employment status). To address this concern, we are going to create a single categorical “factor” variable group which takes a value of tech if someone is employed in tech, whitecollar if someone is employed in other “white-collar” jobs (such as law or finance), other if someone is employed in any other sector, and unemployed if someone is unemployed.

# create factor variable with four named levels, missing for all respondents
immig$group <- factor(NA,levels=c("Tech WC", "Non-tech WC", "Other workers", "Unemployed"))

# fill in values based on existing dummy variables
immig$group[immig$tech.whitcol==1 & immig$nontech.whitcol==0 & immig$employed==1] <- "Tech WC"
immig$group[immig$tech.whitcol==0 & immig$nontech.whitcol==1 & immig$employed==1] <- "Non-tech WC"
immig$group[immig$tech.whitcol==0 & immig$nontech.whitcol==0 & immig$employed==1] <- "Other workers"
immig$group[immig$employed==0] <- "Unemployed"

##       Tech WC   Non-tech WC Other workers    Unemployed 
##            59            58           471           534

Next we will compare the support for H-1B across these conditions by using a linear regression predicting h1bvis.supp using group.

lm(h1bvis.supp ~ group, data = immig)
## Call:
## lm(formula = h1bvis.supp ~ group, data = immig)
## Coefficients:
##        (Intercept)    groupNon-tech WC  groupOther workers     groupUnemployed  
##            0.29661             0.09563             0.04946             0.05217

Interpret the coefficients on all the variables in the model. Is this comparison more or less supportive of the labor market hypothesis than the one in Question 2?

Reveal answer

Overall, the results corroborate the labor market hypothesis: economic threat appears to be even more predictive of H-1B support when we compare tech workers to other high-skilled workers as opposed to people in general. The average support level is 0.297 among tech workers, who are the omitted group from the categorical variable, and thus described by the intercept. In contrast, support for H-1B visas are 0.096 higher among non-tech high-skilled workers, and about 0.05 higher among both low-skilled workers and the unemployed.

Question 4

Those who work in the tech sector are disproportionately young and male. To account for this possibility, we fit another linear regression but also include age and female as pre-treatment covariates (in addition to group).

lm(h1bvis.supp ~ group + female + age, data = immig)
## Call:
## lm(formula = h1bvis.supp ~ group + female + age, data = immig)
## Coefficients:
##        (Intercept)    groupNon-tech WC  groupOther workers     groupUnemployed  
##            0.43338             0.13127             0.07598             0.08984  
##             female                 age  
##           -0.07536            -0.00248

It is often useful to be able to compare the regression coefficients from two models with overlapping sets of variables side by side, rather than by scrolling between two different sets of output. There are many functions in R that enable us to make side by side comparisons of multiple models, here we will use screenreg() in library(texreg):

lmfit1 <- lm(h1bvis.supp ~ group, data = immig)
lmfit2 <- lm(h1bvis.supp ~ group + female + age, data = immig)

## ============================================
##                     Model 1      Model 2    
## --------------------------------------------
## (Intercept)            0.30 ***     0.43 ***
##                       (0.04)       (0.05)   
## groupNon-tech WC       0.10         0.13 *  
##                       (0.06)       (0.06)   
## groupOther workers     0.05         0.08    
##                       (0.04)       (0.04)   
## groupUnemployed        0.05         0.09 *  
##                       (0.04)       (0.04)   
## female                             -0.08 ***
##                                    (0.02)   
## age                                -0.00 ***
##                                    (0.00)   
## --------------------------------------------
## R^2                    0.00         0.03    
## Adj. R^2              -0.00         0.02    
## Num. obs.           1122         1122       
## ============================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

If you compare these results to the previous regression output from Question 3, you will see that the differences between tech workers and everyone else are similar, but slightly larger once we control for age and female.

Question 5

Next, we fit a linear regression model with all threat indicators (group, expl.prejud, impl.prejud) and compare it to the model with just group.

lmfit3 <- lm(h1bvis.supp ~ group + impl.prejud + expl.prejud, data = immig)
## ===========================================
##                     Model 1      Model 2   
## -------------------------------------------
## (Intercept)            0.30 ***    0.51 ***
##                       (0.04)      (0.06)   
## groupNon-tech WC       0.10        0.10    
##                       (0.06)      (0.06)   
## groupOther workers     0.05        0.05    
##                       (0.04)      (0.05)   
## groupUnemployed        0.05        0.04    
##                       (0.04)      (0.05)   
## impl.prejud                       -0.19 ** 
##                                   (0.06)   
## expl.prejud                       -0.26 ***
##                                   (0.06)   
## -------------------------------------------
## R^2                    0.00        0.04    
## Adj. R^2              -0.00        0.03    
## Num. obs.           1122         895       
## ===========================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Note that the bottom of the table shows the \(R^2\) of each model. They are all small. The model with all economic and cultural threat factors only explains about 4% of variation in H-1B visa support. In other words, even after accounting for one’s sector employment, as well implicit or explicit prejudice, most of the variation in immigration attitudes remains unexplained. Consequently, we can conclude that-while being predictive-neither cultural nor economic threat is close to be deterministic of immigration policy preferences.

Question 6

Besides economic and cultural threat, some scholars also argue that gender is an important predictor of immigration attitudes. While there is some evidence that women are slightly less opposed to immigration than men, it may also be true that gender conditions the effect of other factors such as cultural threat. To see if it is indeed the case, we fit a linear regression of H-1B support on the interaction between gender and implicit prejudice.

lmfit4 <- lm(h1bvis.supp ~ impl.prejud + female, data = immig)
lmfit5 <- lm(h1bvis.supp ~ impl.prejud + female + impl.prejud*female, data = immig)
## ============================================
##                     Model 1      Model 2    
## --------------------------------------------
## (Intercept)            0.51 ***     0.60 ***
##                       (0.04)       (0.05)   
## impl.prejud           -0.21 ***    -0.37 ***
##                       (0.06)       (0.09)   
## female                -0.07 ***    -0.21 ** 
##                       (0.02)       (0.07)   
## impl.prejud:female                  0.26 *  
##                                    (0.12)   
## --------------------------------------------
## R^2                    0.03         0.03    
## Adj. R^2               0.02         0.03    
## Num. obs.           1009         1009       
## ============================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

How do we interpret the value of the interaction coefficient in the above regression mean?

Reveal answer

The non-zero coefficient on the interaction between implicit prejudice and being female means that support for H-1B visas has a different relationship to implicit prejudice for men vs women. To understand the meaning of the interaction coefficient, it is best to consider it in relation to the other coefficients in the model. Here, we are interested in whether the association between implicit prejudice and support for H-1B visas is different for men and women. If we consider men, the coefficient describing this association is -0.3692304. For women, the corresponding association is described by the sum of the coefficient for men -0.3692304 and the interaction coefficient 0.2558153, which is -0.1134151. So the relationship between implicit prejudice and support for H-1B visas is stronger or more negative for men than for women.

To see this visually, we create a plot with the predicted level of H-1B support (y-axis) across the range of implicit bias (x-axis) by gender.

h1b.prej.male <- data.frame(impl.prejud = seq(from = 0, to = 1, by = .01), female = 0)
h1b.prej.female <- data.frame(impl.prejud = seq(from = 0, to = 1, by = .01), female = 1)
pred.h1b.prej.male <- predict(lmfit5, newdata = h1b.prej.male)
pred.h1b.prej.female <- predict(lmfit5, newdata = h1b.prej.female)

plot(x = seq(from = 0, to = 1, by = .01), y = pred.h1b.prej.male, type = "l", xlim = c(0, 1), ylim = c(0, 0.6), xlab = "Implicit prejudice", ylab = "Predicted H-1B support")
lines(x = seq(from = 0, to = 1, by = .01), y = pred.h1b.prej.female, lty = "dashed") 
text(0.6, 0.45, "Male") 
text(0.3, 0.3, "Female")

Considering the results, we can conclude that the relationship between H-1B support and implicit bias (cultural threat) is indeed conditional on gender. Women are on average more opposed to this type of immigration, but the link between cultural threat and H-1B support is also weaker among women than men (which is indicated by a less steeper regression slope). As a result, according to the model with interaction effects, we see gender differences in immigration support among unprejudiced but not highly prejudiced voters.

5.3 Homework

There is no homework associated with this seminar assignment, as you will be completing your midterm assessment.