# 5 Multiple linear regression models (I)

## 5.1 Seminar

Create a new script with the following lines at the top, and save it as `seminar5.R`

```
rm(list = ls())
setwd("~/PUBL0055")
```

Next, we need to load a couple of packages. We need a package called `foreign`

for loading the dataset and `texreg`

for printing formatted regression tables.

```
library(foreign)
library(texreg)
```

### 5.1.1 Loading, Understanding and Cleaning our Data

Today, we load the full standard (cross-sectional) dataset from the Quality of Government Institute (this is a newer version that the one we used in week 3). This is a great data source for comparativist political science research. The codebook is available from their main website. You can also find time-series and cross-section data sets on this page.

The dataset is in Stata format (`.dta`

). Loading it requires the `foreign`

package and the `read.dta()`

function which operates similar to `read.csv()`

.

Let’s load the data set

`world_data <- read.dta("http://uclspp.github.io/datasets/data/qog_std_cs_jan15.dta")`

Check the dimensions of the dataset

`dim(world_data)`

`[1] 193 2037`

The dataset contains 2037 variables but we’re only interested in the following:

Variable | Description |
---|---|

wbgi_pse | A measure of political stability (larger values mean more stability) |

lp_lat_abst | Distance to the equator or latitude |

dr_ing | Index for the level of globalization |

ti_cpi | Transparency International’s Corruptions Perceptions Index (larger values mean better quality institutions, i.e. less corruption) |

br_dem | Factor variable stating whether the country is a democracy or not (with labels `1. Democracy` and `0. Dictatorship` ) |

Our dependent variable (also called response or outcome variable) is `wbgi_pse`

. We will rename the variables we care about to something meaningful.

CAUTION: When renaming variables, do not use spaces or special characters in the name. You can, however, use a period (`.`

) or underscore (`_`

) to make the names more readable.

```
names(world_data)[names(world_data) == "cname"] <- "country"
names(world_data)[names(world_data) == "wbgi_pse"] <- "pol_stability"
names(world_data)[names(world_data) == "lp_lat_abst"] <- "latitude"
names(world_data)[names(world_data) == "dr_ig"] <- "globalization"
names(world_data)[names(world_data) == "chga_demo"] <- "democracy"
names(world_data)[names(world_data) == "ti_cpi"] <- "inst_quality"
```

Now Let’s look at some of these variables using the `summary()`

function.

`summary(world_data$pol_stability)`

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.10637 -0.72686 -0.01900 -0.06079 0.78486 1.57240
```

If you think about political stability, and how one could measure it, you know there is an order implicit in the measurement – more or less stability. From there, what you need to know is whether the more or less is ordinal or interval scaled. Checking `pol_stability`

you see a range from roughly -3 to 1.60. The variable is numerical and has decimal places. This tells you that the variable is at least interval scaled. You will not see ordinally scaled variables with decimal places. Examine the summaries of the other variables and determine their level of measurement.

Now let’s look at the variables that we think can explain political stability. We can use the `summary()`

function on more than one variable by combining their names in a vector.

`summary(world_data[c('latitude', 'globalization', 'inst_quality', 'democracy')])`

```
latitude globalization inst_quality democracy
Min. :0.0000 Min. :24.35 Min. :1.010 0. Dictatorship: 74
1st Qu.:0.1444 1st Qu.:45.22 1st Qu.:2.400 1. Democracy :118
Median :0.2444 Median :54.99 Median :3.300 NA's : 1
Mean :0.2865 Mean :57.15 Mean :3.988
3rd Qu.:0.4444 3rd Qu.:68.34 3rd Qu.:5.100
Max. :0.7222 Max. :92.30 Max. :9.300
NA's :12 NA's :12 NA's :12
```

The variables `latitude`

, `globalization`

and `inst_quality`

have 12 missing values each marked as `NA`

. `democracy`

has 1 missing value. Missing values could cause trouble because operations including an `NA`

will produce `NA`

as a result (e.g.: `1 + NA = NA`

). We will drop these missing values from our data set using the `is.na()`

function and square brackets. The exlamation mark in front of `is.na()`

means “not”. So, we keep all rows that are not NA’s on the variable `latitude`

.

`world_data <- world_data[ !is.na(world_data$latitude), ]`

Generally, we want to make sure we drop missing values only from variables that we care about. Now that you have seen how to do this, drop missings from `globalization`

, `inst_quality`

, and `democracy`

yourself.

```
world_data <- world_data[ !is.na(world_data$globalization), ]
world_data <- world_data[ !is.na(world_data$inst_quality), ]
world_data <- world_data[ !is.na(world_data$democracy), ]
```

Let’s check the range of the variable `latitude`

from our summary above. It is between `0`

and `1`

. The codebook clarifies that the latitude of a country’s capital has been divided by `90`

to get a variable that ranges from `0`

to `1`

. This would make interpretation difficult. When interpreting the effect of such a variable a unit change (a change of `1`

) covers the entire range or put differently, it is a change from a country at the equator to a country at one of the poles.

We therefore multiply by `90`

again. This will turn the units of the `latitude`

variable into degrees again which makes interpretation easier.

`world_data$latitude <- world_data$latitude * 90`

### 5.1.2 Estimating a Bivariate Regression

Is there a correlation between the distance of a country to the equator and the level of political stability? Both political stability (dependent variable) and distance to the equator (independent variable) are continuous. Therefore, we will get an idea about the relationship using a scatter plot.

```
plot(pol_stability ~ latitude, data = world_data,
frame.plot = FALSE,
pch = 20,
xlab = "Latitude",
ylab = "Political Stability")
```

Looking at the cloud of points suggests that there might be a positive relationship: increases in our independent variable `latitude`

appear to be associated with increases in the dependent variable `pol_stability`

(the further from the equator, the more stable).

We can fit a line of best fit through the points. To do this we must estimate the bivariate regression model with the `lm()`

function.

`latitude_model <- lm(pol_stability ~ latitude, data = world_data)`

Now we can create a scatterplot with a regression line using the `abline()`

function.

```
plot(pol_stability ~ latitude, data = world_data,
frame.plot = FALSE,
pch = 20,
xlab = "Latitude",
ylab = "Political Stability")
abline(latitude_model, col = "red")
```

We can also view a simple summary of the regression by using the `screenreg`

function:

`screenreg(latitude_model)`

```
=======================
Model 1
-----------------------
(Intercept) -0.58 ***
(0.12)
latitude 0.02 ***
(0.00)
-----------------------
R^2 0.11
Adj. R^2 0.10
Num. obs. 170
RMSE 0.89
=======================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

Thinking back to last week, how can we interpret this regression ouput?

- The coefficient for the variable
`latitude`

(\(\beta_1\)) indicates that a one-unit increase in a country’s latitude is associated with a 0.02 increase in the measure of political stability, on average. Question: Is this association statistically significant at the 95% confidence level? - The coefficient for the
`(intercept)`

term (\(\beta_0\)) indicates that the average level of political stability for a country with a latitude of 0 is -0.58 (where`latitude = 0`

is a country positioned at the equator) - The \(R^2\) of the model is 0.11. This implies that 11% of the variation in the dependent variable (political stability) is explained by the independent variable (latitude) in the model.

### 5.1.3 Multivariate Regression

The regression above suggests that there is a significant association between these variables However, as good social scientistis, we probably do not think that the distance of a country from the equator is a theoretically relevant variable for explaining political stability. This is because there is no plausible causal link between the two. We should therefore consider other variables to include in our model.

We will include the index of globalization (higher values mean more integration with the rest of the world), the quality of institutions, and the indicator for whether the country is a democracy. For all of these variables we can come up with a theoretical story for their effect on political stability.

To specify a *multiple* linear regression model, the only thing we need to change is the `formula`

argument of the `lm()`

function. In particular, if we wish to add additional explanatory variables, the formula argument will take the following form:

`dependent.var ~ independent.var.1 + independent.var.2 + independent.var.3 ...`

In our example here, the model would therefore look like the following:

```
inst_model <- lm(
pol_stability ~ latitude + globalization + inst_quality + democracy,
data = world_data
)
```

Remember, `pol_stability`

is our dependent variable, as before, and now we have four independent variables: `latitude`

, `globalization`

, `democracy`

and `inst_quality`

. Again, just as with the bivariate model, we can view the summarised output of the regression by using `screenreg()`

. As we now have two models (a simple regression model, and a multiple regression model), we can join them together using the `list()`

function, and then put all of that inside `screenreg()`

.

`screenreg(list(latitude_model, inst_model))`

```
=============================================
Model 1 Model 2
---------------------------------------------
(Intercept) -0.58 *** -1.25 ***
(0.12) (0.20)
latitude 0.02 *** 0.00
(0.00) (0.00)
globalization -0.00
(0.01)
inst_quality 0.34 ***
(0.04)
democracy1. Democracy 0.04
(0.11)
---------------------------------------------
R^2 0.11 0.50
Adj. R^2 0.10 0.49
Num. obs. 170 170
RMSE 0.89 0.67
=============================================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

Including the two new predictors leads to substantial changes.

- First, we now explain 50% of the variance of our dependent variable instead of just 11%.
- Second, the effect of the distance to the equator is no longer significant.
- Third, better quality institutions are associated with more political stability. In particular, a one-unit increase in the measure of instituion quality (which ranges from 1 to 10) is associated with a 0.34 increase in the measure for political stability.
- Fourth, there is no significant relationship between globalization and political stability in this data.
- Fifth, there is no significant relationship between democracy and political stability in this data.

### 5.1.4 Predicting outcome conditional on institutional quality

Just as we did with the simple regression model last week, we can use the fitted model object to calculate the fitted values of our dependent variable for different values of our explanatory variables. To do so, we again use the `predict()`

function.

We proceed in three steps.

- We set the values of the covariates for which we would like to produce fitted values.
- You will need to set covariate values for
*every*explanatory variable that you included in your model. - As only one of our variables has a significant relationship with the outcome in the multiple regression model that we estimated above, we are really only interested in that variable (
`inst_quality`

). - Therefore, we will calculate fitted values over the range of
`inst_quality`

, while setting the values of`latitude`

and`globalization`

to their mean values. - As
`democracy`

is a factor variable, we cannot use the mean value. Instead, we will set`democracy`

to be equal to`"1. Democracy"`

which is the label for democratic countries

- You will need to set covariate values for
- We calculate the fitted values.
- We report the results (here we will produce a plot).

For step one, the following code produces a `data.frame`

of new covariate values for which we would like to calculate a fitted value from our model:

```
democracies <- data.frame(
inst_quality = seq(from = 1.4, to = 9.3, by = 1),
globalization = mean(world_data$globalization),
latitude = mean(world_data$latitude),
democracy = "1. Democracy"
)
```

We’ve just created a `data.frame`

of hypothetical democracies with varying level of institutional quality. Let’s see what this `data.frame`

looks like.

`democracies`

```
inst_quality globalization latitude democracy
1 1.4 57.93053 25.78218 1. Democracy
2 2.4 57.93053 25.78218 1. Democracy
3 3.4 57.93053 25.78218 1. Democracy
4 4.4 57.93053 25.78218 1. Democracy
5 5.4 57.93053 25.78218 1. Democracy
6 6.4 57.93053 25.78218 1. Democracy
7 7.4 57.93053 25.78218 1. Democracy
8 8.4 57.93053 25.78218 1. Democracy
```

In this `data.frame`

, we have set the `inst_quality`

variable to vary between 1.40 and 9.30, with increments of 1 unit which represents the range of `inst_quality`

in our dataset.

`min(world_data$inst_quality)`

`[1] 1.4`

`max(world_data$inst_quality)`

`[1] 9.3`

We have set `globalization`

to be equal to the mean value of `globalization`

in the `world_data`

object, and `latitude`

to be equal to the mean value of `latitude`

in the `world_data`

object. Finally, we have set `democracy`

to be equal to `"1. Democracy"`

(the value for democratic countries).

We can now calculate the fitted values for each of these combinations of our explanatory variables using the `predict()`

function.

`democracies$predicted_pol_stability <- predict(inst_model, newdata = democracies)`

We can now look again at the `democracies`

object:

`democracies`

```
inst_quality globalization latitude democracy predicted_pol_stability
1 1.4 57.93053 25.78218 1. Democracy -1.00319887
2 2.4 57.93053 25.78218 1. Democracy -0.66431685
3 3.4 57.93053 25.78218 1. Democracy -0.32543483
4 4.4 57.93053 25.78218 1. Democracy 0.01344719
5 5.4 57.93053 25.78218 1. Democracy 0.35232921
6 6.4 57.93053 25.78218 1. Democracy 0.69121123
7 7.4 57.93053 25.78218 1. Democracy 1.03009325
8 8.4 57.93053 25.78218 1. Democracy 1.36897527
```

Hey presto! Now, for each of our explanatory variable combinations, we have the corresponding fitted values as calculated from our estimated regression.

Finally, we can plot these values:

```
plot(
predicted_pol_stability ~ inst_quality,
data = democracies,
frame.plot = FALSE,
col = "blue",
type = "l",
xlab = "Institution Quality",
ylab = "Fitted value for political stability"
)
```

We could also use the output from our model to plot two separate lines of fitted values: one for democracies, and one for dictatorships. We have already done this for democracies, so the following code constructs a `data.frame`

of fitted values for dictatorships:

```
dictatorships <- data.frame(
inst_quality = seq(from = 1.4, to = 9.3, by = 1),
globalization = mean(world_data$globalization),
latitude = mean(world_data$latitude),
democracy = "0. Dictatorship"
)
dictatorships$predicted_pol_stability <- predict(inst_model, newdata = dictatorships)
```

Now that we have calculated these fitted values, we can add the line for dictatorships to the plot we created above using the `lines()`

function:

```
plot(
predicted_pol_stability ~ inst_quality,
data = democracies,
frame.plot = FALSE,
col = "blue",
type = "l",
xlab = "Institution Quality",
ylab = "Fitted value for political stability"
)
lines(predicted_pol_stability ~ inst_quality,
data = dictatorships,
col = "red")
```

We can see from the plot that the fitted values for democracies (blue line) are almost exactly the same as those for dictatorships (red line). This is reassuring, as the estimated coefficient on the democracy variable was very small (0.04) and was not statistically significantly different from 0. Often, however, it can be very illuminating to construct plots like this where we construct a line to indicate how our predicted values for Y vary across one of our explanatory variables (here, institution quality), and we create different lines for different values of another explanatory variable (here, democracy/dictatorship).

### 5.1.5 Seminar Feedback

We want to make sure the seminars are helping you make the most of this module. This super-short survey aims to gather student feedback to inform us how to better support you in the seminar sessions during the second half of the module. The responses you provide are absolutely anonymous.

### 5.1.6 Exercises

- Create a new file called
`assignment5.R`

in your`PUBL0055`

folder and write all the solutions in it. - Go to the QoG website and download the QoG Cross-Section Data and the QoG Standard Codebook.
- Pretend that you are writing a research paper. Select one dependent variable that you want to explain with a statistical model.
- Run a simple regression (with one explanatory variable) on that dependent variable and justify your choice of explanatory variable. This will require you to think about the outcome that you are trying to explain, and what you think will be a good predictor for that outcome.
- Add some additional explanatory variables (no more than 5) to your model and justify the choice again.
- Produce a regression table of the newer model.
- Interpret the output of the two models. State your expectations and whether they were met.
- Calculate fitted values whilst varying at least one of your continuous explanatory variables.
- Plot the result.
- Interpret the plot.
- Save your script, which should now include the answers to all the exercises.
- Source your script, i.e. run the entire script all at once. Fix the script if you get any error messages.