4.2 Solutions

4.2.1 Exercise 1

Create a new file called assignment4.R in your PUBL0055 folder and write all the solutions in it.

In RStudio, go to the menu and select File > New File > R Script

Make sure to clear the environment and set the working directory.

rm(list = ls())

Go to the menu and select File > Save and name it assignment4.R

Next, we load all the packages we need for these exercises.


4.2.2 Exercise 2

Load the non-western foreigners dataset from week 2.


4.2.3 Exercise 3

Estimate a model that explains subjective number of immigrants per 100 British citizens using only one independent variable. Justify your choice. (You do not have to pick the best variable but try to make a reasonable argument why more of x should lead to more/less of y).

model1 <- lm(IMMBRIT ~ RAge, data = fdata)

We use age as predictor variable. We argue that older people are less positive towards immigrations and tend to overestimate the number of immigrants to overemphasize the problem. We, therefore, expect a positive relationship between age and the perception of immigration.

4.2.4 Exercise 4

Plot a scatterplot of the relationship and add the regression line to the plot.

  IMMBRIT ~ RAge, data = fdata,
  xlab = "Age",
  ylab = "Immigration Perception",
  frame.plot = FALSE,
  pch = 20,
  col = "LightSkyBlue"

abline(model1, col = "red")

The regression line slopes downward, ever so slightly, pointing towards a tiny negative relationship. The residuals seem to be extraordinarily large. It seems, that connection between age and immigration perception is weak at best.

4.2.5 Exercise 5

Interpret the regression output and try to imagine that you are communicating your results to someone who does not know anything about statistics.


             Model 1    
(Intercept)    31.38 ***
RAge           -0.05    
R^2             0.00    
Adj. R^2        0.00    
Num. obs.    1049       
RMSE           21.06    
*** p < 0.001, ** p < 0.01, * p < 0.05

We cannot, with sufficient confidence, rule out that age and immigration perception are unrelated (\(p > 0.05\)). Furthermore, \(R^2\) indicates, that our model does a terrible job at predicting the perception of immigration. We could predict the outcome (perception of immigration) equally well without our model.

Suppose, we predict the mean of the perception of immigration IMMBRIT for all 1049 respondents. The quality of that prediction (in statistics jargon, the naive guess) would be as good as the predictions we get from our model.

4.2.6 Exercise 6

Estimate another model (i.e. choose a different independent variable) on the same dependent variable. Justify the choice.

model2 <- lm(IMMBRIT ~ HHInc, data = fdata)

We choose income as the predictor in our second model. We conjecture that on average wealthier people are more educated and hence have a more realistic view of immigration. Furthermore, they tend to have less competition form immigration and, therefore, tend not to exaggerate the level of immigration. Therefore, we expect that the wealthier the respondent the lower the respondent’s estimate of immigration.

4.2.7 Exercise 7

Interpret the new regression output.


             Model 1    
(Intercept)    43.12 ***
HHInc          -1.47 ***
R^2             0.10    
Adj. R^2        0.10    
Num. obs.    1049       
RMSE           19.94    
*** p < 0.001, ** p < 0.01, * p < 0.05

In line with our expectation, wealthier people perceive immigration to be lower than poorer people. The relationship is significant at the 0.05 level. We explain 10% of the perception of immigration with our model. Considering that this model is very small (we use only one predictor variable), we do quite well at predicting the outcome.

4.2.8 Exercise 8

Compare the two models and explain which one you would choose.

screenreg(list(model1, model2))

             Model 1      Model 2    
(Intercept)    31.38 ***    43.12 ***
               (1.95)       (1.41)   
RAge           -0.05                 
HHInc                       -1.47 ***
R^2             0.00         0.10    
Adj. R^2        0.00         0.10    
Num. obs.    1049         1049       
RMSE           21.06        19.94    
*** p < 0.001, ** p < 0.01, * p < 0.05

Model two is much superior to model one in terms of explaining the phenomenon we are interested in, the perception of immigration. Furthermore, we learn nothing about potential causes of overestimating immigration from model one, whereas from model two, we do.

4.2.9 Exercise 9

htmlreg(list(model1, model2), file = "regression_model.doc")

4.2.10 Exercise 10

Consider the following table. This analysis asks whether individuals who have spent longer in education have higher yearly earnings. The analysis is based on a sample of 300 individuals. The dependent variable in this analysis is the yearly income of the individual in UK pounds (earnings). The independent variable measures the number of years the individual spent in full-time education (education).

             Model 1     
(Intercept)   3663.85    
education     1270.81 ***
R^2              0.17    
Adj. R^2         0.17    
Num. obs.      300       
RMSE         14018.52    
*** p < 0.001, ** p < 0.01, * p < 0.05
  1. Interpret the coefficient on the education variable.

    For each additional year of education we expect earnings to go up by 1270.81 pounds on average.

  2. Using the values given in the table, calculate the test-statistic

    We compute the t-statistic using the formula:

    \[ \frac{\bar{Y_{HA}}-\mu_{H_{0}}}{ \sigma_{\bar{Y_{HA}}} } \]

    So, we take the alternative hypothesis (our estimate of the effect of education) minus the mean under the null hypothesis and divide the result by the standard error of our estimate. Unless stated otherwise, the null hypothesis is that there is no effect of education on income, i.e. the null is zero.

    The coefficient estimate here is 1270.81 and its standard error is 160.97.

    The t-statistic is 1270.81 / 160.97 = 7.89

  3. Can we reject the null hypothesis of no effect at the 95% confidence level? (Just looking at the stars is not sufficient here! How can we work out the result of the hypothesis test?)

    We have 300 observations in our sample and because we estimate two parameters, 298 degrees of freedom. A t distribution with 298 degrees of freedom is well approximated by the standard normal distribution. Under the normal distribution \(95\%\) are within \(1.96\) standard deviations from the mean. Our t-statistic is more extreme than that. Our estimate is 7.89 standard deviations from the mean. It is unlikely to observe such an extreme value by chance (assuming the null hypothesis, there is no relation between education and income, is true). We therefore, reject the null hypothesis at 0.05 level.