5.2 Solutions

5.2.1 Exercise 1

Create a new file called assignment5.R in your PUBL0055 folder and write all the solutions in it.

In RStudio, go to the menu and select File > New File > R Script

Make sure to clear the environment and set the working directory.

rm(list = ls())
setwd("~/PUBL0055")

Go to the menu and select File > Save and name it assignment5.R

Next, we load all the packages we need for these exercises.

library(texreg)

5.2.2 Exercise 2

Go to the QoG website and download the QoG Cross-Section Data and the QoG Standard Codebook.

There are two files to download:

  1. The QoG Cross-Section Data (in either .csv or Stata format)
  2. The QoG Standard Codebook

Make sure to save the dataset to your PUBL0055 folder. This is the same folder that you specified when setting the working directory above and where all the other datasets reside.

qog_data <- read.csv("qog_std_cs_jan18.csv")

5.2.3 Exercise 3

Pretend that you are writing a research paper. Select one dependent variable that you want to explain with a statistical model.

Let’s check the dimensions of the dataset.

dim(qog_data)
[1]  194 1882

The dataset has 194 observations and 1882 variables.

We are interested in explaining the quality of government idicator, where larger values indicate better overall governance. Before we go on to check what predicts better governance, let’s rename the variable to something meaningful.

names(qog_data)[names(qog_data) == "icrg_qog"] <- "gov_quality"

Now let’s take a look at the variable.

summary(qog_data$gov_quality) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.08333 0.38889 0.47222 0.52551 0.63310 0.97222      55 

There are 55 NA’s so let’s remove them first.

qog_data <- qog_data[ !is.na(qog_data$gov_quality), ]

The range of the quality of government variable is 0 to 1. We will re-scale it to 0 - 100 to make results a little bit easier to interpret.

qog_data$gov_quality <- qog_data$gov_quality * 100

5.2.3.1 Exercise 4

From Putnam’s Bowling Alone, we know of the theory of the importance of social capital for successful democratic governance. We, therefore, choose the social capital indicator bti_sc, where larger values stand for more social capital. We expect better governance in countries where social capital is higher. Before we estimate the model, let’s rename it.

names(qog_data)[names(qog_data) == "bti_sc"] <- "social_capital"

We look at the variable and remove any NA’s.

summary(qog_data$social_capital)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    4.00    5.00    5.21    6.00    9.00      34 
qog_data <- qog_data[ !is.na(qog_data$social_capital), ]

Our sample size is down to 105 countries. This is still a decent sample size. However, we would like to avoid loosing too many observations. Smaller sample sizes are a danger to inference (through increased variance – effects are less likely to be discovered and if they are, they are more likely to be overestimates).

model1 <- lm(gov_quality ~ social_capital, data = qog_data)
screenreg(model1)

==========================
                Model 1   
--------------------------
(Intercept)      26.00 ***
                 (3.74)   
social_capital    3.57 ***
                 (0.68)   
--------------------------
R^2               0.21    
Adj. R^2          0.20    
Num. obs.       105       
RMSE             12.37    
==========================
*** p < 0.001, ** p < 0.01, * p < 0.05

Firstly, the intercept is not meaningful in this case because social capital ranges from 1 to 9, i.e. there is no country with 0 social capital.

The coefficient of social capital is positive and significant. Dividing the estimate by the standard error yields the t-value which is 5.25. At a conventional alpha level of 0.05, we reject the null hypothesis when t is more extreme than \(\pm\) 1.96. This is the case here. In line with our expectation, more social capital increases quality of governance by 3.57 points (on a 0 - 100 scale).

Increasing social capital by one point of its scale is a substantial amount (1/9 or 11% of its range). So, if social capital increases by a substantial amount, does governance also go up by a substantial amount? Let’s check the range of quality of governance in our dataset.

summary(qog_data$gov_quality) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.333  36.111  44.444  44.592  53.009  86.111 

A 3.57 point change on a 8.33 to 86.11 scale is 3.57/(86.11-8.33) or 4.59%. That is still a substantial change considering that we are measuring how well governed an entire country is.

5.2.4 Exercise 5

Add some additional explanatory variables (no more than 5) to your model and justify the choice again.

We have not talked about this in the lecture before and it will not be a hurdle in the midterm. However, it is important that you keep an eye on sample size. If the sample size becomes too small it is unlikely that you can make meaningful statements about the world. We, therefore, selected variables with an eye on the missing values and their theoretical importance. If you see studies that are based on small samples, be suspicious.

Firstly, we used infant mortality as a general indicator for development in the sense that basic health services can be provided. We expect that if health conditions get worse then that should affect the capacity to govern negatively because it points to a basic lack of resources. We include a variable that measures conflict intensity. Violent conflict can impede the ability to govern. Finally, we added an index for the freedom of the press. The press is often considered to be the forth column in a state (besides the legislative, executive, and judicial branches). The function is to check the government. A role, that can be best fulfilled by a free press. This will lead to a more accountable government which we conjecture increases the quality of government.

We rename the variables first.

names(qog_data)[names(qog_data)=="wef_imort"] <- "infant_mortality"
names(qog_data)[names(qog_data)=="fhp_score5"] <- "press_freedom"
names(qog_data)[names(qog_data)=="bti_ci"] <- "conflict_intensity"

Check for missing values.

summary(qog_data$infant_mortality) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2.30    9.05   15.90   26.59   41.05  117.40      10 
summary(qog_data$press_freedom)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.00   42.00   56.00   56.19   70.00   97.00 
summary(qog_data$conflict_intensity)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.000   4.829   6.000  10.000 

And then drop observations with any missing values.

qog_data <- qog_data[ !is.na(qog_data$infant_mortality), ]

Note: To make models comparable, they should be based on the exact same sample. Since we’ve dropped some observations because infant_mortality was missing, our dataset is no longer the same as the one we used to estimate our first regression model. So, after dropping the missing values, we have to estimate our first model again.

model1_rerun <- lm(gov_quality ~ social_capital, data = qog_data)

screenreg(list(model1, model1_rerun), custom.model.names = c("Model 1", "Model 1 (rerun)"))

===========================================
                Model 1     Model 1 (rerun)
-------------------------------------------
(Intercept)      26.00 ***  28.94 ***      
                 (3.74)     (4.12)         
social_capital    3.57 ***   3.18 ***      
                 (0.68)     (0.73)         
-------------------------------------------
R^2               0.21       0.17          
Adj. R^2          0.20       0.16          
Num. obs.       105         95             
RMSE             12.37      12.21          
===========================================
*** p < 0.001, ** p < 0.01, * p < 0.05

As you can see, the difference is not small. This is less than ideal. \(R^2\) goes down by alot and the magnitude of social capital also changes. There is not much you can do (without advanced stats) except being transparent about the differences.

Now we add our new control variables to the model.

model2 <- lm(
  gov_quality ~ social_capital + infant_mortality + press_freedom + conflict_intensity, 
  data = qog_data
)

5.2.5 Exercise 6

Produce a regression table of the newer model.

screenreg(list(model2))

==============================
                    Model 1   
------------------------------
(Intercept)          54.92 ***
                    (11.70)   
social_capital        1.06    
                     (1.13)   
infant_mortality     -0.26 ***
                     (0.05)   
press_freedom        -0.03    
                     (0.10)   
conflict_intensity   -1.24 *  
                     (0.60)   
------------------------------
R^2                   0.44    
Adj. R^2              0.41    
Num. obs.            95       
RMSE                 10.23    
==============================
*** p < 0.001, ** p < 0.01, * p < 0.05

5.2.6 Exercise 7

Interpret the output of the two models. State your expectations and whether they were met.

screenreg(list(model1_rerun, model2))

=========================================
                    Model 1    Model 2   
-----------------------------------------
(Intercept)         28.94 ***   54.92 ***
                    (4.12)     (11.70)   
social_capital       3.18 ***    1.06    
                    (0.73)      (1.13)   
infant_mortality                -0.26 ***
                                (0.05)   
press_freedom                   -0.03    
                                (0.10)   
conflict_intensity              -1.24 *  
                                (0.60)   
-----------------------------------------
R^2                  0.17        0.44    
Adj. R^2             0.16        0.41    
Num. obs.           95          95       
RMSE                12.21       10.23    
=========================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Our re-run bivariate model explains 17% of the variance of the quality of government. This is not bad for a small model. By adding our three variables, we increased explained variance to 44% which is a large increase. Our new model seems to explain quality of government much better than the first model.

The intercept is not meaningful in either model. The coefficient of social capital is no longer significant. We, therefore, do not find evidence for the effect of social capital when we control basic health, press freedom, and conflict intensity. Press freedom does not reach conventional levels of significance (the coefficient is less than 1.96 standard deviations away from zero). We find that the worse the health conditions, the less well governed a country will be. Infant mortality is measured in deaths/1000 live births. If that proportion increases by 10, then the quality of government index decreases by 0.26 which corresponds to roughly 3.61% of its range.

We also find evidence of a decline in quality of government as conflict intensity increases. The effect of conflict intensity is statistically significant at the 95% level. Conflict intensity is measured on a scale from 1 to 10. A one unit increase in the conflict_intensity variable amounts to a 10% increase in conflict intensity and is associated with a decrease of 1.24 in quality of government which is equivalent to 1.71% of its range.

5.2.7 Exercise 8

We have two significant variables. It does not make sense to vary insignificant variables because, we cannot reject the null that their effect is zero. We pick one of the two statistically significant variables (infant mortality in this case) and vary it from minimum to maximum. You could do the same for the other significant variable if you wanted to.

summary(qog_data$infant_mortality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.30    9.05   15.90   26.59   41.05  117.40 

Set our covariates

covariates <- data.frame(
  social_capital = mean(qog_data$social_capital), 
  infant_mortality = seq(from = 2, to = 120, by = 1),
  press_freedom = mean(qog_data$press_freedom),
  conflict_intensity = mean(qog_data$conflict_intensity)
)

Check that everything looks okay

head(covariates)
  social_capital infant_mortality press_freedom conflict_intensity
1       5.389474                2      54.37895           4.705263
2       5.389474                3      54.37895           4.705263
3       5.389474                4      54.37895           4.705263
4       5.389474                5      54.37895           4.705263
5       5.389474                6      54.37895           4.705263
6       5.389474                7      54.37895           4.705263
covariates$predicted_quality <- predict(model2, covariates)

Now lets check our covariates data.frame again

head(covariates)
  social_capital infant_mortality press_freedom conflict_intensity
1       5.389474                2      54.37895           4.705263
2       5.389474                3      54.37895           4.705263
3       5.389474                4      54.37895           4.705263
4       5.389474                5      54.37895           4.705263
5       5.389474                6      54.37895           4.705263
6       5.389474                7      54.37895           4.705263
  predicted_quality
1          52.48832
2          52.22792
3          51.96752
4          51.70712
5          51.44672
6          51.18632

5.2.7.1 Exercise 10

Plot the result.

plot(
  predicted_quality ~ infant_mortality, data = covariates,
  type = "l",
  frame.plot = FALSE,
  main = "Effect of infant mortality on quality of government",
  ylab = "Quality of government",
  xlab = "Infant mortality per 1000 live births"
)

5.2.7.2 Exercise 11

Interpret the plot.

The plot illustrates a strong negative relation between infant mortality, our proxy for basic health and resource conditions, and the quality of government generally. At 20 deaths per 1000 live births, we predict quality of government at roughly 48. At 100 deaths per 1000 live births, quality of government is predicted to be only around 27. Therefore, quality of government decreased by 21 which amounts to a 43.75% decline, a substantial change.

5.2.7.3 Extra Info (not required for midterm)

Below, we will show you how you could illustrate the confidence interval around the prediction. This is not necessary for the midterm but useful to know. We can tell the predict() function that we’re interested in the standard errors using the se.fit = TRUE argument.

predicted_quality <- predict(model2, newdata = covariates, se.fit = TRUE)

The values returned from the predict() function are not just the fitted values but also include the standard errors. Let’s take a look at what we get back from predict() when using the se.fit option.

names(predicted_quality)
[1] "fit"            "se.fit"         "df"             "residual.scale"

The fit as the name suggests has the fitted values, and the se.fit are the standard errors. We have everything we need to calculate the confidence intervals for our fitted values.

covariates$predicted_quality <-  predicted_quality$fit

covariates$predicted_quality_lower_bound <- predicted_quality$fit - (1.96 * predicted_quality$se.fit)
covariates$predicted_quality_upper_bound <- predicted_quality$fit + (1.96 * predicted_quality$se.fit)

Now, we can draw confidence intervals.

plot(
  predicted_quality ~ infant_mortality, data = covariates,
  type = "l",
  frame.plot = FALSE,
  main = "Effect of infant mortality on quality of government",
  ylab = "Quality of government",
  xlab = "Infant mortality per 1000 live births"
)

lines(predicted_quality_lower_bound ~ infant_mortality, 
      data = covariates, 
      lty = "dashed")

lines(predicted_quality_upper_bound ~ infant_mortality, 
      data = covariates, 
      lty = "dashed")

This was purely for your enjoyment and you won’t be tested on it. It just illustrates the uncertainty of the predictions.