# 6 Multiple linear regression models (II)

## 6.1 Seminar

Create a new script with the following lines at the top, and save it as `seminar6.R`

```
rm(list = ls())
setwd("~/PUBL0055")
```

We load the required packages for this week:

```
library(foreign)
library(texreg)
```

### 6.1.1 Loading Data

We will use the small version of the Quality of Government data from 2012 again (`QoG2012.csv`

) with four variables:

Variable | Description |
---|---|

`former_col` |
0 = not a former colony 1 = former colony |

`undp_hdi` |
UNDP Human Development Index. Higher values mean better quality of life |

`wbgi_cce` |
Control of corruption. Higher values mean better control of corruption |

`wdi_gdpc` |
GDP per capita in US dollars |

```
world_data <- read.csv("QoG2012.csv")
names(world_data)
```

```
[1] "h_j" "wdi_gdpc" "undp_hdi" "wbgi_cce" "wbgi_pse"
[6] "former_col" "lp_lat_abst"
```

Rename the variables by yourself to:

New Name | Old Name |
---|---|

`human_development` |
`undp_hdi` |

`institutions_quality` |
`wbgi_cce` |

`gdp_capita` |
`wdi_gdpc` |

`names(world_data)`

```
[1] "h_j" "wdi_gdpc" "undp_hdi" "wbgi_cce" "wbgi_pse"
[6] "former_col" "lp_lat_abst"
```

```
names(world_data)[names(world_data)=="undp_hdi"] <- "human_development"
names(world_data)[names(world_data)=="wbgi_cce"] <- "institutions_quality"
names(world_data)[names(world_data)=="wdi_gdpc"] <- "gdp_capita"
names(world_data)
```

```
[1] "h_j" "gdp_capita" "human_development"
[4] "institutions_quality" "wbgi_pse" "former_col"
[7] "lp_lat_abst"
```

Now let’s look at the summary statistics for the entire data set.

`summary(world_data) `

```
h_j gdp_capita human_development institutions_quality
Min. :0.0000 Min. : 226.2 Min. :0.2730 Min. :-1.69953
1st Qu.:0.0000 1st Qu.: 1768.0 1st Qu.:0.5390 1st Qu.:-0.81965
Median :0.0000 Median : 5326.1 Median :0.7510 Median :-0.30476
Mean :0.3787 Mean :10184.1 Mean :0.6982 Mean :-0.05072
3rd Qu.:1.0000 3rd Qu.:12976.5 3rd Qu.:0.8335 3rd Qu.: 0.50649
Max. :1.0000 Max. :63686.7 Max. :0.9560 Max. : 2.44565
NA's :25 NA's :16 NA's :19 NA's :2
wbgi_pse former_col lp_lat_abst
Min. :-2.46746 Min. :0.0000 Min. :0.0000
1st Qu.:-0.72900 1st Qu.:0.0000 1st Qu.:0.1343
Median : 0.02772 Median :1.0000 Median :0.2444
Mean :-0.03957 Mean :0.6289 Mean :0.2829
3rd Qu.: 0.79847 3rd Qu.:1.0000 3rd Qu.:0.4444
Max. : 1.67561 Max. :1.0000 Max. :0.7222
NA's :7
```

We need to remove missing values from `gdp_capita`

, `human_development`

, and `institutions_quality`

. Do so yourself. Do not drop observations that missing values on other observations such as `lp_lat_abst`

. We might throw away useful information when doing so.

```
world_data <- world_data[ !is.na(world_data$gdp_capita), ]
world_data <- world_data[ !is.na(world_data$human_development), ]
world_data <- world_data[ !is.na(world_data$institutions_quality), ]
```

`former_col`

is a categorical variable, let’s see how many observations are in each category:

`table(world_data$former_col) `

```
0 1
61 111
```

Turn this variable into a factor variable and check the result with a frequency table.

```
world_data$former_col <- factor(
world_data$former_col, labels = c("never colonies", "ex colonies")
)
table(world_data$former_col)
```

```
never colonies ex colonies
61 111
```

Now let’s create a scatterplot between `institutions_quality`

and `human_development`

and color the points based on the value of `former_col`

.

We’re also adding a legend to the plot.

```
plot(
human_development ~ institutions_quality,
data = world_data,
frame.plot = FALSE,
col = former_col,
pch = 20,
main = "Relation between institutional quality and hdi by colonial past",
xlab = "quality of institutions",
ylab = "human development index"
)
legend(
"bottomright",
legend = levels(world_data$former_col),
col = c("black", "red"),
pch = 20,
bty = "n" # no box around the legend
)
```

To explain the level of development with quality of institutions is intuitive. We could add the colonial past dummy, to control for potential confounders. Including a dummy gives us the difference between former colonies and not former colonies. It therefore leads to a parallel shift in the regression line. To see some review information about the interpretation of dummy variables, refer to the material at the bottom of page.

### 6.1.2 Interactions: Continuous and Binary

From the plot above, we can tell that the slope of the line (the effect of institutional quality) is probably different in countries that were colonies and those that were not. We say: the effect of institutional quality is *conditional* on colonial past.

To specify an interaction term, we use the asterisk (`*`

)

Example | |
---|---|

`*` |
`A*B` - In addition to the interaction term (A*B), both the constituents (A and B) are automatically included. |

```
model1 <- lm(human_development ~ institutions_quality * former_col, data = world_data)
screenreg( model1 )
```

```
======================================================
Model 1
------------------------------------------------------
(Intercept) 0.79 ***
(0.02)
institutions_quality 0.08 ***
(0.01)
former_colex colonies -0.12 ***
(0.02)
institutions_quality:former_colex colonies 0.05 **
(0.02)
------------------------------------------------------
R^2 0.56
Adj. R^2 0.55
Num. obs. 172
RMSE 0.12
======================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

Before interpreting the model output, it can be helpful to plot the fitted values from the model. To do so, we will need to specify the values of all the explanatory variables for which we would like to calculate fitted values. We obviously want to make a comparison between former colonies and never colonies, so we will first set our covariate `former_col`

to countries that weren’t colonized and then second, to ex colonies.

For each of these values of, we will calculate fitted values along the entire range of the quality of institutions variable (i.e. from `-1.7`

to `2.5`

which is roughly the minimum to the maximum of the variable.) We know the range of values for `institutions_quality`

from the summary statistics we obtained after loading the dataset at the begining of the seminar. You can also use the `range()`

function.

`range(world_data$institutions_quality)`

`[1] -1.699529 2.445654`

We now illustrate what the interaction effect does. To anticipate, the effect of the quality of institutions is now conditional on colonial past. That means, the two regression lines will have different slopes.

We make use of the `predict()`

function to draw both regression lines into our plot. First, we need to vary the institutional quality variable from its minimum to its maximum. We use the `seq()`

(sequence) function to create 10 different institutional quality values.

`institutions_seq <- seq(from = -1.7, to = 2.5, length.out = 10)`

Next, we create a covariate dataset and set the `former_col`

variable to “never colonies”.

```
not_colonies <- data.frame(
former_col = "never colonies",
institutions_quality = institutions_seq
)
head(not_colonies)
```

```
former_col institutions_quality
1 never colonies -1.7000000
2 never colonies -1.2333333
3 never colonies -0.7666667
4 never colonies -0.3000000
5 never colonies 0.1666667
6 never colonies 0.6333333
```

We then predict the fitted values.

`yhat1 <- predict(model1, newdata = not_colonies)`

Now let’s do the same for ex colonies.

```
ex_colonies <- data.frame(
former_col = "ex colonies",
institutions_quality = institutions_seq
)
head(ex_colonies)
```

```
former_col institutions_quality
1 ex colonies -1.7000000
2 ex colonies -1.2333333
3 ex colonies -0.7666667
4 ex colonies -0.3000000
5 ex colonies 0.1666667
6 ex colonies 0.6333333
```

`yhat2 <- predict(model1, newdata = ex_colonies)`

We now have the predicted outcomes for varying institutional quality. Once for the countries that were former colonies and once for the countries that were not.

We will re-draw our earlier plot. In addition, right below the `plot()`

function, we use the `lines()`

function to add the two regression lines. The function needs to arguments `x`

and `y`

which represent the coordinates on the respective axes. On the x axis we vary our independent variable quality of institutions. On the y axis, we vary the predicted outcomes.

We add two more arguments to our `lines()`

function. We set the colour is controlled with `col`

which we set to the first and second colours in the colour palette respectively.

```
# main plot
plot(
human_development ~ institutions_quality,
data = world_data,
frame.plot = FALSE,
col = former_col,
pch = 20,
main = "Relation between institutional quality and hdi by colonial past",
xlab = "quality of institutions",
ylab = "human development index"
)
# add the regression line for the countries that weren't colonised
lines(x = institutions_seq, y = yhat1, col = "black")
# add the regression line for the ex colony countries
lines(x = institutions_seq, y = yhat2, col = "red")
# add a legend
legend(
"bottomright",
legend = levels(world_data$former_col),
col = c("black", "red"),
pch = 20,
lty = 1,
bty = "n"
)
```

As you can see, the line is steeper for ex colonies than for countries that were never colonised. That means the effect of institutional quality on human development is conditional on colonial past. Institutional quality matters more in ex colonies.

Let’s examine the effect sizes of institutional quality conditional on colonial past.

There are now two scenarios. If a country was never a colony, R translates the value of the factor variable `former_col`

from `never colony`

to \(0\). From the equation above, it follows that all terms that are multiplied with `former_col`

drop out.

Therefore, the effect of the quality of institutions in never colonies is just the coefficient of `institutions_quality`

\(\beta_1 = 0.08\).

In the second scenario, we are looking at ex colonies. R translates the value of the factor variable `former_col`

from `ex colonies`

to \(1\). In this case none of the terms drop out. From our original equation:

The effect of the quality of institutions is then: \(\beta_1 + \beta_3 = 0.08 + 0.05 = 0.13\).

The regression output, consistent with the plot above, tell us that the effect of the quality of institutions is bigger in ex-colonies than in never-colonies. For never-colonies the effect is \(0.08\) for every unit-increase in institutional quality. For ex-colonies, the corresponding effect is \(0.13\).

The table below summarises the interaction of a continuous variable with a binary variable in the context of our regression model.

Ex Colony | Intercept | Slope |
---|---|---|

0 = never colony | \(\beta_0\) \(= 0.79\) |
\(\beta_1\) \(= 0.08\) |

1 = ex colony | \(\beta_0 + \beta_2\) = \(0.79 + -0.12 = 0.67\) |
\(\beta_1 + \beta_3\) \(= 0.08 + 0.05 = 0.13\) |

### 6.1.3 Non-Linearities

In lecture, we also learnt about various strategies for modelling non-linear relationships using regression. As a working example, let’s suppose we want to illustrate the relationship between GDP per capita and the human development index. We draw a scatter plot to investigate the relationship between the quality of life (hdi) and wealth (gdp/captia).

```
plot(
human_development ~ gdp_capita,
data = world_data,
pch = 20,
frame.plot = FALSE,
col = "grey",
main = "Relationship between the quality of life and wealth",
ylab = "Human development index",
xlab = "GDP per capita"
)
```

It’s easy to see, that the relationship between GDP per captia and the Human Development Index is not linear. Increases in wealth rapidly increase the quality of life in poor societies. The richer the country, the less pronounced the effect of additional wealth. We would mis-specify our model if we do not take the non-linear relationship into account.

Let’s go ahead and fit a mis-specified model:

```
linear.model <- lm(human_development ~ gdp_capita, data = world_data)
screenreg(linear.model)
```

```
=======================
Model 1
-----------------------
(Intercept) 0.59 ***
(0.01)
gdp_capita 0.00 ***
(0.00)
-----------------------
R^2 0.49
Adj. R^2 0.49
Num. obs. 172
RMSE 0.13
=======================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

We detect a significant linear relationship. The effect may look small because the coefficient rounded to two digits is zero. But remember, this is the effect of increasing GDP/capita by \(1\) US dollar on the quality of life. That effect is naturally small but it is probably not small when we increase wealth by \(1000\) US dollars.

However, our model would also entail that for every increase in GDP/capita, the quality of life increases on average by the same amount. We saw from our plot that this is not the case. The effect of GDP/capita on the quality of life is conditional on the level of GDP/capita. If that sounds like an interaction to you, then that is great because, we will model the non-linearity by raising the GDP/capita to a higher power. That is in effect an interaction of the variable with itself. GDP/capita raised to the second power, e.g. is GDP/capita * GDP/capita.

#### 6.1.3.1 Polynomials

We know from school that polynomials like \(X^2\), \(X^3\) and so on are not linear. The relationship depicted in our plot above suggests a quadratic relationship between GDP per capita and the human development index might be appropriate (there is only one obvious “bend” in the data). To incorporate this intuition into our model, we will use the `poly()`

function in our linear model to raise GDP/capita to the second power like so: `poly(gdp_capita, degree = 2).`

```
quadratic.model <- lm(
human_development ~ poly(gdp_capita, degree = 2),
data = world_data
)
screenreg(
list(linear.model, quadratic.model),
custom.model.names = c("Linear model", "Quadratic model")
)
```

```
============================================================
Linear model Quadratic model
------------------------------------------------------------
(Intercept) 0.59 *** 0.70 ***
(0.01) (0.01)
gdp_capita 0.00 ***
(0.00)
poly(gdp_capita, degree = 2)1 1.66 ***
(0.10)
poly(gdp_capita, degree = 2)2 -1.00 ***
(0.10)
------------------------------------------------------------
R^2 0.49 0.67
Adj. R^2 0.49 0.66
Num. obs. 172 172
RMSE 0.13 0.10
============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

It is important to note, that in the quadratic model the effect of GDP/capita is no longer easy to interpret. We cannot say for every increase in GDP/capita by one dollar, the quality of life increases on average by some fixed amount. This is because the effect of GDP/capita depends on how rich a country was to begin with. We can, however, interpret the statitical significance of the quadratic term: the coefficient is about 10 times as large in absolute terms as the standard error, meaning we can easily reject the null hypothesis that the relationship between GDP and the HDI is linear. Furthermore, it looks like our model that includes the quadratic term has a much better fit, as the adjusted \(R^2\) has increased by a lot.

We can interpret the effect of wealth (GDP/capita) on the quality of life (human development index) by predicting the fitted values of the human development index given a certain level of GDP/capita. We will vary GDP/captia from its minimum in the data to its maximum and the plot the results which is a good way to illustrate a non-linear relationship.

Step 1: We find the minimum and maximum values of GDP/capita.

`range(world_data$gdp_capita)`

`[1] 226.235 63686.676`

Step 2: We predict fitted values for varying levels of GDP/captia (let’s create 100 predictions).

We create our sequence of 100 GDP/capita values

`gdp_seq <- seq(from = 226, to = 63686, length.out = 100)`

We set our covariate values (here we only have one covariate: GDP/captia)

`x <- data.frame(gdp_capita = gdp_seq)`

We predict the outcome (human development index) for each of the 100 GDP levels

`y_hat <- predict(quadratic.model, newdata = x)`

Step 3: Now that we have created our predictions. We plot again and then we add the `linear.model`

using `abline`

and we add our non-linear version `quadratic.model`

using the `lines()`

function.

```
plot(
human_development ~ gdp_capita,
data = world_data,
pch = 20,
frame.plot = FALSE,
col = "grey",
main = "Relationship between the quality of life and wealth",
ylab = "Human development index",
xlab = "GDP per capita"
)
# the linear model
abline(linear.model, col = "black")
# better model
lines(x = gdp_seq, y = y_hat, col = "red")
```

We could go further than this quadratic specification, and instead estimate a `cubic.model`

with GDP/capita raised to the power of three. Do this now and present your results visually.

Estimate cubic model

`cubic.model <- lm(human_development ~ poly(gdp_capita, 3), data = world_data)`

We predict the outcome (human development index) for each of the 100 GDP levels

`y_hat2 <- predict(cubic.model, newdata = x)`

Next, we plot the linear and quadratic models

```
plot(
human_development ~ gdp_capita,
data = world_data,
pch = 20,
frame.plot = FALSE,
col = "grey",
main = "Relationship between the quality of life and wealth",
ylab = "Human development index",
xlab = "GDP per capita"
)
# the linear model
abline(linear.model, col = "black")
# quadratic model
lines(x = gdp_seq, y = y_hat, col = "red")
# cubic model
lines(x = gdp_seq, y = y_hat2, col = "green")
```

The cubic model provides a similar story to the quadratic model (the relationship between GDP/capita and HDI is steep at first, and then flatter at the higher levels of GDP/captia), but it looks somewhat strange, particularly at the higher levels of GDP/capita. In particular, it seems that the few extreme X values are causing a strange shape: the cubic is being wagged around by its tail. This is a common problem with polynomial regression models, and so we will now consider an alternative.

#### 6.1.3.2 Log-transformations

Many non-linear relationships actually do look linear on the log scale. We can illustrate this by taking the natural logarithm of GDP/captia and plot the relationship between quality of life and our transformed GDP variable.

Note: Some of you will remember from your school calculators that you have an ln button and a log button where ln takes the natural logarithm and log takes the logarithm with base 10. The natural logarithm represents relations that occur frequently in the world and R takes the natural logarithm with the `log()`

function by default.

Below, we plot the same plot from before but we wrap `gdp_capita`

in the `log()`

function which log-transforms the variable.

```
plot(
human_development ~ log(gdp_capita),
data = world_data,
pch = 20,
frame.plot = FALSE,
col = "grey",
main = "Relationship between the quality of life and wealth on the log scale",
ylab = "Human development index",
xlab = "Logged gdp/capita"
)
```

As you can see, the relationship now looks much closer to linear and we get the best fit to the data (as measured by adjusted \(R^2\)) if we run our model with log-transformed gdp.

`log.model <- lm(human_development ~ log(gdp_capita), data = world_data)`

Let’s check our model

```
screenreg(
list(linear.model, quadratic.model, cubic.model, log.model),
custom.model.names = c("Linear Model", "Quadratic Model", "Cubic Model", "Log Model")
)
```

```
=====================================================================================
Linear Model Quadratic Model Cubic Model Log Model
-------------------------------------------------------------------------------------
(Intercept) 0.59 *** 0.70 *** 0.70 *** -0.36 ***
(0.01) (0.01) (0.01) (0.04)
gdp_capita 0.00 ***
(0.00)
poly(gdp_capita, degree = 2)1 1.66 ***
(0.10)
poly(gdp_capita, degree = 2)2 -1.00 ***
(0.10)
poly(gdp_capita, 3)1 1.66 ***
(0.09)
poly(gdp_capita, 3)2 -1.00 ***
(0.09)
poly(gdp_capita, 3)3 0.65 ***
(0.09)
log(gdp_capita) 0.12 ***
(0.00)
-------------------------------------------------------------------------------------
R^2 0.49 0.67 0.74 0.81
Adj. R^2 0.49 0.66 0.74 0.81
Num. obs. 172 172 172 172
RMSE 0.13 0.10 0.09 0.08
=====================================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

Polynomials can be useful for modelling non-linearities. However, for each additional polynomial term we add, we also add an additional parameter that needs to be estimated. This reduces the degrees of freedom of the model. If we can get a linear relationship on the log scale, one advantage is that we use the same number of parameters as in the original linear model.

Furthermore, we gain interpretability. The relationship is linear on the log scale of gdp/capita. This means we can interpret the effect of gdp/captia as: For an increase of gdp/captia by one percent, the quality of life increases by \(\frac{\hat{\beta}}{100} = \frac{0.12}{100}\) points on average. The effect is very large because `human_development`

only varies from \(0\) to \(1\).

The adjusted \(R^2\) also suggests that the log model provides the best fit to the data. To illustrate that this is the case, we return to our plot and show the model fit graphically.

Get the fitted values for the log model.

`y_hat3 <- predict(log.model, newdata = x)`

Create a plot showing the fitted values

```
plot(
human_development ~ gdp_capita,
data = world_data,
pch = 20,
frame.plot = FALSE,
col = "grey",
main = "Relationship between the quality of life and wealth",
ylab = "Human development index",
xlab = "GDP per capita"
)
# Linear model
abline(linear.model, col = "black")
# Quadratic model
lines(x = gdp_seq, y = y_hat, col = "red")
# Cubic model
lines(x = gdp_seq, y = y_hat2, col = "green")
# Log model
lines(x = gdp_seq, y = y_hat3, col = "blue")
# Add a legend
legend("bottomright",
legend = c("Linear", "Quadratic", "Cubic", "Log"),
lty = 1,
col = c("black", "red", "green", "blue")
)
```

The blue line shows the log-transformed model. It clearly fits the data best.

### 6.1.4 Exercises

- Create a new file called
`assignment6.R`

in your`PUBL0055`

folder and write all the solutions in it. - Using the
`quadratic.model`

that we estimated above, where we included the square of GDP/capita, what is the effect of:- an increase of GDP/capita from 5000 to 15000?
- an increase of GDP/capita from 25000 to 35000?

- You can see that the curve in our quadratic plot curves down when countries become very rich. Speculate whether that results make sense and what the reason for this might be.
- Raise GDP/captia to the highest power using the
`poly()`

that improves model fit according to adjusted \(R^2\).- Does your new model solve the potentially artificial down-curve for rich countries?
- Does the new model improve upon the old model?
- Plot the new model.

- Estimate a model where
`wbgi_pse`

(political stability) is the response variable and`h_j`

and`former_col`

are the explanatory variables. Include an interaction between your explanatory variables. What is the marginal effect of:- An independent judiciary when the country is a former colony?
- An independent judiciary when the country was not colonized?

- Run a model on the human development index (
`hdi`

), interacting an independent judiciary (`h_j`

) and`institutions_quality`

. What is the effect of quality of institutions:- In countries without an independent judiciary?
- When there is an independent judiciary?
- Illustrate your results.
- Does the interaction improve model fit?

- Save your script, which should now include the answers to all the exercises.
- Source your script, i.e. run the entire script all at once. Fix the script if you get any error messages.

### 6.1.5 Extra Info: Dummy Variables Repetition

The variable `ex_colony`

is binary. It takes on two values: “never colonies” or “ex colonies”. The first value, “never colonies” is the baseline. When the variable takes on the value “ex colonies”, we end up with an intercept shift. Consequently, we get a second parallel regression line.

```
model1 <- lm(human_development ~ institutions_quality + ex_colony, data = world_data)
screenreg(model1)
```

```
================================
Model 1
--------------------------------
(Intercept) 0.77 ***
(0.02)
institutions_quality 0.10 ***
(0.01)
ex_colonyex colonies -0.11 ***
(0.02)
--------------------------------
R^2 0.54
Adj. R^2 0.54
Num. obs. 172
RMSE 0.12
================================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

#### 6.1.5.1 The Effect of a Dummy Variable

Ex Colony | Intercept | Slope |
---|---|---|

0 = not a former colony | `Intercept` + (`ex_colonyex colonies` * 0)= 0.77 + (-0.11 * 0) = 0.77 |
`institutions_quality` = 0.10 |

1 = former colony | `Intercept` + (`ex_colonyex colonies` * 1)= 0.77 + (-0.11 * 1) = 0.66 |
`institutions_quality` = 0.10 |

To illustrate the effect of a dummy we can access the coefficients of our `lm`

model directly with the `coefficients()`

function:

```
model_coef <- coefficients(model1)
model_coef
```

```
(Intercept) institutions_quality ex_colonyex colonies
0.7734319 0.1021146 -0.1140925
```

We can use the square brackets `[ ]`

to access individual coefficients.

To illustrate the effect of a dummy we draw a scatterplot and then use `abline()`

to draw two regression lines, one with `ex_colonyex colonies == "never colony"`

and another with `ex_colony == "former colony"`

.

Instead of passing the model as the first argument to `abline()`

, we can just pass the intercept and slope as two separate arguments.

```
plot(
human_development ~ institutions_quality,
data = world_data,
frame.plot = FALSE,
col = ex_colony,
pch = 20,
xlab = "Quality of institutions",
ylab = "Human development index",
main = "Effect of a binary variable"
)
# the regression line when ex_colony = never colony
abline(model_coef[1], model_coef[2], col = "black")
# the regression line when ex_colony = ex colony
abline(model_coef[1] + model_coef[3], model_coef[2], col = "red")
# add a legend to the plot
legend(
"bottomright",
legend = levels(world_data$ex_colony),
col = c("black", "red"),
lty = 1,
pch = 20,
bty = "n"
)
```