7.2 Solutions

7.2.1 Part 1

7.2.1.1 Exercise 1

Create a new file called assignment7.R in your PUBL0055 folder and write all the solutions in it.

In RStudio, go to the menu and select File > New File > R Script

Make sure to clear the environment and set the working directory.

rm(list = ls())
setwd("~/PUBL0055")

Go to the menu and select File > Save and name it assignment7.R

Next, we load all the packages we need for these exercises.

library(texreg)
library(lmtest) 
library(sandwich)

7.2.1.2 Exercise 2

  1. Load the chechen.csv dataset from your PUBL0055 folder.
chechen_data <- read.csv("chechen.csv")

7.2.1.3 Exercise 3

Run a bivariate regression model with diff as the dependent variable and treat as the predictor.

model1 <- lm(diff ~ treat, data = chechen_data)

screenreg(model1)

======================
             Model 1  
----------------------
(Intercept)   -0.10   
              (0.12)  
treat         -0.52 **
              (0.17)  
----------------------
R^2            0.03   
Adj. R^2       0.02   
Num. obs.    318      
RMSE           1.54   
======================
*** p < 0.001, ** p < 0.01, * p < 0.05

The effect of Russian artillery shelling is significant and negative. Therefore, it seems that indiscriminate violence in this case reduces insurgent violence. There are many potential confounders, however, that we will control for in the next exercise.

The interpretation on the magnitude of the effect is the following: The dependent variable is the change in the number of insurgent attacks. The independent variable is binary. Therefore, there was half an attack less, on average, in villages that had been bombarded by Russian artillery.

7.2.1.4 Exercise 4

Control for additional variables. Discuss your choice of variables and refer to omitted variable bias.

We control for the amount of attacks before the village was bombarded (pret). We reason that villages that saw more insurgent activity already, were more likely to be bombarded and might see more insurgent activity after bombardment as well. Furhtermore, we control for logged elevation lelev because attacks may be more likely in remote mountainous terrain and it may have been harder for the Russians to control such areas. With a similar argument we control groznyy and lpop2000, were the argument is that urban areas may see more insurgent actitivities and are harder to control. We also control for swept to account for the possibility that previous sweeps in the area have reduced the likelihood of subsequent shelling and insurgent activity.

model2 <- lm(
  diff ~ treat + pret + lelev + groznyy + lpop2000 + swept + factor(disno), 
  data = chechen_data
)

screenreg(list(model1, model2))

======================================
                 Model 1    Model 2   
--------------------------------------
(Intercept)       -0.10      -0.93    
                  (0.12)     (1.64)   
treat             -0.52 **   -0.39 *  
                  (0.17)     (0.16)   
pret                         -0.28 ***
                             (0.03)   
lelev                         0.07    
                             (0.24)   
groznyy                       2.03 ** 
                             (0.67)   
lpop2000                      0.13    
                             (0.07)   
swept                        -0.02    
                             (0.19)   
factor(disno)3               -0.14    
                             (0.52)   
factor(disno)4                0.44    
                             (0.66)   
factor(disno)5               -0.09    
                             (1.11)   
factor(disno)6                0.08    
                             (0.59)   
factor(disno)7               -0.37    
                             (1.41)   
factor(disno)8               -0.74    
                             (0.78)   
factor(disno)9               -0.14    
                             (0.68)   
factor(disno)10              -0.49    
                             (0.51)   
factor(disno)11              -0.22    
                             (0.60)   
factor(disno)14               0.04    
                             (0.53)   
factor(disno)15              -0.14    
                             (0.56)   
--------------------------------------
R^2                0.03       0.32    
Adj. R^2           0.02       0.28    
Num. obs.        318        318       
RMSE               1.54       1.32    
======================================
*** p < 0.001, ** p < 0.01, * p < 0.05

7.2.1.5 Exercise 5

Check for non-linearity and if there is, solve the problem.

plot(
  chechen_data$pret, residuals(model2),
  pch = 20,
  frame.plot = FALSE,
  col = "LightSkyBlue"
)

We only have one continuous variable in our model that has not already been transformed to deal with non-linearity: the number of insurgent attacks before shelling pret. The residual plot does not reveal a non-linear relation.

7.2.1.6 Exercise 6

Check for heteroskedasticity and if there is solve the problem.

bptest(model2) 

    studentized Breusch-Pagan test

data:  model2
BP = 78.61, df = 17, p-value = 6.764e-10
model3 <- coeftest(model2, vcov. = vcovHC(model2, type = "HC1"))

screenreg(model3)

==========================
                 Model 1  
--------------------------
(Intercept)      -0.93    
                 (1.25)   
treat            -0.39 *  
                 (0.15)   
pret             -0.28 ***
                 (0.04)   
lelev             0.07    
                 (0.20)   
groznyy           2.03 *  
                 (1.02)   
lpop2000          0.13 *  
                 (0.05)   
swept            -0.02    
                 (0.23)   
factor(disno)3   -0.14    
                 (0.67)   
factor(disno)4    0.44    
                 (0.82)   
factor(disno)5   -0.09    
                 (0.75)   
factor(disno)6    0.08    
                 (0.77)   
factor(disno)7   -0.37    
                 (0.67)   
factor(disno)8   -0.74    
                 (0.84)   
factor(disno)9   -0.14    
                 (0.72)   
factor(disno)10  -0.49    
                 (0.70)   
factor(disno)11  -0.22    
                 (0.73)   
factor(disno)14   0.04    
                 (0.71)   
factor(disno)15  -0.14    
                 (0.71)   
==========================
*** p < 0.001, ** p < 0.01, * p < 0.05

7.2.1.7 Exercise 7

Thoroughly compare the models, and interpret the outcome in substantial as well statistical terms - do not forget to talk about model fit. What are the implications of this?

screenreg(list(model1, model2, model3))

=================================================
                 Model 1    Model 2     Model 3  
-------------------------------------------------
(Intercept)       -0.10      -0.93      -0.93    
                  (0.12)     (1.64)     (1.25)   
treat             -0.52 **   -0.39 *    -0.39 *  
                  (0.17)     (0.16)     (0.15)   
pret                         -0.28 ***  -0.28 ***
                             (0.03)     (0.04)   
lelev                         0.07       0.07    
                             (0.24)     (0.20)   
groznyy                       2.03 **    2.03 *  
                             (0.67)     (1.02)   
lpop2000                      0.13       0.13 *  
                             (0.07)     (0.05)   
swept                        -0.02      -0.02    
                             (0.19)     (0.23)   
factor(disno)3               -0.14      -0.14    
                             (0.52)     (0.67)   
factor(disno)4                0.44       0.44    
                             (0.66)     (0.82)   
factor(disno)5               -0.09      -0.09    
                             (1.11)     (0.75)   
factor(disno)6                0.08       0.08    
                             (0.59)     (0.77)   
factor(disno)7               -0.37      -0.37    
                             (1.41)     (0.67)   
factor(disno)8               -0.74      -0.74    
                             (0.78)     (0.84)   
factor(disno)9               -0.14      -0.14    
                             (0.68)     (0.72)   
factor(disno)10              -0.49      -0.49    
                             (0.51)     (0.70)   
factor(disno)11              -0.22      -0.22    
                             (0.60)     (0.73)   
factor(disno)14               0.04       0.04    
                             (0.53)     (0.71)   
factor(disno)15              -0.14      -0.14    
                             (0.56)     (0.71)   
-------------------------------------------------
R^2                0.03       0.32               
Adj. R^2           0.02       0.28               
Num. obs.        318        318                  
RMSE               1.54       1.32               
=================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

The goal of analysis was to test whether indiscriminate violence leads to more resistance. The variable of interest is, therefore, treat which indicates whether a Chechnian village was bombarded or not. In our initial model, we found a significant and negative effect. Thereby, we contradict the theory because the effect seems to go the other way.

We were concerned about confounding and including a number of variables in the model. We controled for terrain, geographic areas in the country, urban areas, attacks before artillery bombardment, and prior sweeps. The change in the coeffcient is large. The difference is (\(-.52 - -0.39 = -0.13\)). That is 25% of the original effect. Although, the effect decreased substantially when we control for confounders, it is still significant. The difference between villages that were bombarded and those that were not is \(0.39\) attacks on average.

According to the model with the larger standard errors, pret, and groznyy are also significant. Insurgent attacks were more numerous in Groznyy. Interestingly, the more attacks there were before the bombardment , the fewer attacks happened afterwards.

Model fit improves substantially, from \(0.03\) in our first model to \(0.32\). Although, the second model explains the outcome better that is not what we care about. We are interested in the effect of indiscriminate violence. We prefer the larger model because we control for potential confounders and our estimate less of an overestimate.

7.2.2 Part 2

7.2.2.1 Exercise 1

Load who_voted_nazi_in_1932.csv.

nazi_data <- read.csv("who_voted_nazi_in_1932.csv")

7.2.2.2 Exercise 2

Run a bivariate model to test whether blue-collar voters voted for the Nazis.

model1 <- lm(sharenazis ~ shareblue, data = nazi_data)

screenreg(model1)

=======================
             Model 1   
-----------------------
(Intercept)   39.56 ***
              (1.66)   
shareblue      0.07    
              (0.05)   
-----------------------
R^2            0.00    
Adj. R^2       0.00    
Num. obs.    681       
RMSE          10.80    
=======================
*** p < 0.001, ** p < 0.01, * p < 0.05

7.2.2.3 Exercise 3

Our dataset does not contain much information. We only know about the share of diffrent groups in the district: white collar workers, blue collar workers, self employed, domestically employed, and unemployed. We control for these variables.

A model that would only include blue collar workers omits information about the social make of the district that is necessarily related to the share of blue collar workers. For, example if there are mostly white collar workers in a district that means, there are less blue collar workers, all else equal. If we only have blue collar in our model, we do not know whether most other people are unemployed or white collar workers and so on.

model2 <- lm(
  sharenazis ~ shareblue + sharewhite + shareself + sharedomestic, 
  data = nazi_data
)

7.2.2.4 Exercise 4

Can you fit a model using shareblue, sharewhite, shareself, sharedomestic, and shareunemployed as predictors? Why or why not?

Such a model would suffer from perfect multicollinearity. Whichever variable we entered last into our model would be dropped automatically. If we add the share of the five groups, we get 100%. Thus, if we know the share of four gropus in a district, we necessarily know the share of the remaining group in the district.

For example if shareblue=30, shareunemployed=15, sharedomestic=25, shareself=20, the four groups are together 90% of the district. The remaining 10% must be white collar workers.

Therefore, adding shareblue, shareunemployed, sharedomestic, and shareself to our model already gives us information about how many white collar workers there are in a district. Were we to add the variable sharewhite, we would enter the same information twice - leading to perfect multicollinearity.

7.2.2.5 Exercise 5

Check for heteroskedasticity and if there is solve the problem.

bptest(model1)

    studentized Breusch-Pagan test

data:  model1
BP = 18.578, df = 1, p-value = 1.631e-05
model1_robust <- coeftest(model1, vcov = vcovHC(model1))
bptest(model2)

    studentized Breusch-Pagan test

data:  model2
BP = 95.811, df = 4, p-value < 2.2e-16
model2_robust <- coeftest(model2, vcov = vcovHC(model2))

screenreg(
  list(model1, model1_robust, model2, model2_robust),
  custom.model.names = c("Model 1", "Mode 1 (robust)", "Model2", "Model 2 (robust)")
)

========================================================================
               Model 1     Mode 1 (robust)  Model2      Model 2 (robust)
------------------------------------------------------------------------
(Intercept)     39.56 ***  39.56 ***         -2.82      -2.82           
                (1.66)     (1.68)            (7.01)     (6.28)          
shareblue        0.07       0.07              0.57 ***   0.57 ***       
                (0.05)     (0.05)            (0.09)     (0.08)          
sharewhite                                    0.31 *     0.31 **        
                                             (0.13)     (0.11)          
shareself                                     1.14 ***   1.14 ***       
                                             (0.19)     (0.23)          
sharedomestic                                 0.08       0.08           
                                             (0.10)     (0.11)          
------------------------------------------------------------------------
R^2              0.00                         0.11                      
Adj. R^2         0.00                         0.10                      
Num. obs.      681                          681                         
RMSE            10.80                        10.24                      
========================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Correcting for heteroskedasticity does not change our results.

7.2.2.6 Exercise 6

Interpret your models.

screenreg(list(model1, model2))

=====================================
               Model 1     Model 2   
-------------------------------------
(Intercept)     39.56 ***   -2.82    
                (1.66)      (7.01)   
shareblue        0.07        0.57 ***
                (0.05)      (0.09)   
sharewhite                   0.31 *  
                            (0.13)   
shareself                    1.14 ***
                            (0.19)   
sharedomestic                0.08    
                            (0.10)   
-------------------------------------
R^2              0.00        0.11    
Adj. R^2         0.00        0.10    
Num. obs.      681         681       
RMSE            10.80       10.24    
=====================================
*** p < 0.001, ** p < 0.01, * p < 0.05

A prominent theory suggests that Nazis received a lot of their support from blue-collar workers. In model 1, we only include the percentage of blue collar workers in a district. The variable is insignificant but we are likely suffering from omitted variable bias because we are omitting other aspects of the social make-up of the district that are related to the share blue collar workers.

We omit one category from our models to avoid perfect multicollinearity. In model 2, it seems that blue-collar workers did support the Nazis. More so, than white-collar workers. They received most of their support from the self-employed, though.

7.2.3 Part 3

7.2.3.1 Exercise 1

Load the Massachusetts Test Scores dataset MA_Schools.csv from your PUBL0055 folder.

schools <- read.csv("MA_Schools.csv")

7.2.3.2 Exercise 2

Fit a model to explain the relationship between percentage of English learners and average 8th grade scores.

model1 <- lm(score8 ~ english, data = schools)

screenreg(model1)

=======================
             Model 1   
-----------------------
(Intercept)  702.99 ***
              (1.44)   
english       -3.64 ***
              (0.44)   
-----------------------
R^2            0.28    
Adj. R^2       0.28    
Num. obs.    180       
RMSE          17.89    
=======================
*** p < 0.001, ** p < 0.01, * p < 0.05

7.2.3.3 Exercise 3

Plot the model and the regression line.

plot(
  score8 ~ english,
  data = schools,
  pch = 20,
  frame.plot = FALSE,
  col = "LightSkyBlue"
)

abline(model1, col = "red")

7.2.3.4 Exercise 4

Check for correlation between the variables listed above to see if there are other variables that should be included in the model.

schools <- schools[!is.na(schools$score8), ]

cor(schools[, c("score8", "stratio", "english", "lunch", "income")])
            score8    stratio    english      lunch     income
score8   1.0000000 -0.3164449 -0.5312991 -0.8337520  0.7765043
stratio -0.3164449  1.0000000  0.2126209  0.2875544 -0.2616973
english -0.5312991  0.2126209  1.0000000  0.6946009 -0.2510956
lunch   -0.8337520  0.2875544  0.6946009  1.0000000 -0.5738153
income   0.7765043 -0.2616973 -0.2510956 -0.5738153  1.0000000

7.2.3.5 Exercise 5

Both lunch and income are highly correlated with the percent of English learners and with the response variable. We add both to our model.

model2 <- lm(score8 ~ english + lunch + income, data = schools)

7.2.3.6 Exercise 6

Compare the two models. Did the first model suffer from omitted variable bias?

screenreg(list(model1, model2))

===================================
             Model 1     Model 2   
-----------------------------------
(Intercept)  702.99 ***  678.59 ***
              (1.44)      (3.45)   
english       -3.64 ***   -0.25    
              (0.44)      (0.31)   
lunch                     -0.72 ***
                          (0.07)   
income                     1.70 ***
                          (0.15)   
-----------------------------------
R^2            0.28        0.83    
Adj. R^2       0.28        0.83    
Num. obs.    180         180       
RMSE          17.89        8.80    
===================================
*** p < 0.001, ** p < 0.01, * p < 0.05

The comparison between models 1 and 2, shows that the performance of the children is related to poverty rather than language. The two are, however, correlated.

7.2.3.7 Exercise 7

Plot the residuals against the fitted values from the second model to visually check for heteroskedasticity.

plot(
  fitted(model2), residuals(model2),
  pch = 20,
  frame.plot = FALSE,
  col = "LightSkyBlue"
)
  
abline(h=0)

The plot looks okay. I would not conclude that the model violates the constant errors assumption.

7.2.3.8 Exercise 8

Run the Breusch-Pagan Test to see whether the model suffers from heteroskedastic errors.

bptest(model2)

    studentized Breusch-Pagan test

data:  model2
BP = 2.3913, df = 3, p-value = 0.4953

The Bresch Pagan test confirms, the constant error assumption is not violated.

7.2.3.9 Exercise 9

Correct for heteroskedasticity (if needed) and present the results with corrected standard errors and p-values.

We do not need to do this because the constant error assumption is not violated.