4 Bivariate linear regression models

4.1 Seminar

Create a new script with the following lines at the top, and save it as seminar4.R

rm(list = ls())

4.1.1 Packages

We will introduce the use of packages in this week’s seminar. The package texreg makes it easy to produce publication quality output from our regression models. We’ll discuss this package in more detail as we go along. For now let’s load the package with the library() function.


We will use a dataset collected by the US census bureau that contains several socioeconomic indicators.

communities <- read.csv("communities.csv")

The dataset includes 38 variables but we’re only interested in a handful at the moment.

Variable Description
PctUnemployed proportion of citizens in each community who are unemployed
PctNotHSGrad proportion of citizens in each community who failed to finish high-school
population proportion of adult population living in cities
  PctUnemployed PctNotHSGrad population
1          0.27         0.18       0.19
2          0.27         0.24       0.00
3          0.36         0.43       0.00
4          0.33         0.25       0.04
5          0.12         0.30       0.01
6          0.10         0.12       0.02

If we summarize these variables with the summary() function, we will see that they are both measured as proportions (they vary between 0 and 1):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.2200  0.3200  0.3635  0.4800  1.0000 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.2300  0.3600  0.3833  0.5100  1.0000 

It will be a little easier to interpret the regression output if we convert these to percentages rather than proportions. We can do this with the following lines of code:

communities$PctUnemployed <- communities$PctUnemployed * 100
communities$PctNotHSGrad <- communities$PctNotHSGrad * 100

We can begin by drawing a scatterplot with the percentage of unemployed people on the y-axis and the percentage of adults without high-school education on the x-axis.

  PctUnemployed ~ PctNotHSGrad, data = communities,
  xlab = "Adults without high school education (%)",
  ylab = "Unemployment (%)",
  frame.plot = FALSE,
  pch = 20,
  col = "LightSkyBlue"

From looking at the plot, what is the association between the unemployment rate and lack of high-school level education?

In order to answer that question empirically, we will run a linear regression using the lm() function in R. The lm() function needs to know a) the relationship we’re trying to model and b) the dataset for our observations. The two arguments we need to provide to the lm() function are described below.

Argument Description
formula The formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable
In our case, we’d like to model the relationship using the formula: PctUnemployed ~ PctNotHSGrad
data This is simply the name of the dataset that contains the variable of interest. In our case, this is the merged dataset called communities.

For more information on how the lm() function works, type help(lm) in R.

model1 <- lm(PctUnemployed ~ PctNotHSGrad, data = communities)

4.1.2 Interpreting Regression Output

The lm() function has modeled the relationship between PctUnemployed and PctNotHSGrad and we’ve saved it in an object called model1. Let’s use the summary() function to see what this linear model looks like.


The output from summary() might seem overwhelming at first so let’s break it down one item at a time.

# Item Description
1 formula The formula describes the relationship between the dependent and independent variables
2 residuals The differences between the observed values and the predicted values are called residuals.
3 coefficients The coefficients for all the independent variables and the intercept. Using the coefficients we can write down the relationship between the dependent and the independent variables as:

PctUnemployed = 7.90 + ( 0.74 * PctNotHSGrad )

This tells us that for each unit increase in the variable PctNotHSGrad, the PctUnemployed increases by 0.74.
4 standard error The standard error estimates the standard deviation of the sampling distribution of the coefficients in our model. We can think of the standard error as the measure of precision for the estimated coefficients.
5 t-statistic The t-statistic is obtained by dividing the coefficients by the standard error.
6 p-value The p-value for each of the coefficients in the model. Recall that according to the null hypotheses, the value of the coefficient of interest is zero. The p-value tells us whether can can reject the null hypotheses or not.
7 \(R^2\) and adj-\(R^2\) tell us how much of the variance in our model is accounted for by the independent variable. The adjusted \(R^2\) is always smaller than \(R^2\) as it takes into account the number of independent variables and degrees of freedom.

Now let’s add a regression line to the scatter plot using the abline() function.

First we run the same plot() function as before, then we overlay a line with abline():

  PctUnemployed ~ PctNotHSGrad, data = communities,
  xlab = "Adults without high school education (%)",
  ylab = "Unemployment (%)",
  frame.plot = FALSE,
  pch = 20,
  col = "LightSkyBlue"

abline(model1, lwd = 3, col = "red")

We can see by looking at the regression line that it matches the coefficients we estimated above. For example, when PctNotHSGrad is equal to zero (i.e. where the line intersects the Y-axis), the predicted value for PctUnemployed seems to be above 0 but below 10. This is good, as the intercept coefficient we estimated in the regression was 7.90.

Similarly, the coefficient for the variable PctNotHSGrad was estimated to be 0.74, which implies that a one point increase in the percentage of citizens with no high-school education is associated with about 0.74 of a point increase in the percentage of citizens who are unemployed. The line in the plot seems to reflect this: it is upward sloping, so that higher levels of the no high-school variable are associated with higher levels of unemployment, but the relationship is not quite 1-to-1. That is, for each additional percentage point of citzens without high school education, the percentage of citizens who are unemployed increases by a little less than one point.

While the summary() function provides a slew of information about a fitted regression model, we often need to present our findings in easy to read tables similar to what you see in journal publications. The texreg package we loaded earlier allows us to do just that.

Let’s take a look at how to display the output of a regression model on the screen using the screenreg() function from texreg.


              Model 1    
(Intercept)      7.90 ***
PctNotHSGrad     0.74 ***
R^2              0.55    
Adj. R^2         0.55    
Num. obs.     1994       
RMSE            13.52    
*** p < 0.001, ** p < 0.01, * p < 0.05

Here, the output includes some of the most salient details we need for interpretation. We can see the coefficient for the PctNotHSGrad variable, and the estimated coefficient for the intercept. Below these numbers, in brackets, we can see the standard errors. The table also reports the \(R^2\), the adjusted \(R^2\), the number of observations (\(n\)) and the root-mean-squared-error (\(RMSE\)).

One thing to note is that the table does not include either t-statistics or p-values for the estimated coefficents. Instead, the table employs a common device of using stars to denote whether a variable is statistically significant at a given alpha level.

  • *** indicates that the coefficient is significant at the 99.9% confidence level (alpha = 0.001)
  • ** indicates that the coefficient is significant at the 99% confidence level (alpha = 0.01)
  • * indicates that the coefficient is significant at the 95% confidence level (alpha = 0.05)

Returning to our example, are there other variables that might affect the unemployment rate in our dataset? For example, is the unemployment rate higher in rural areas? To answer this question, we can swap PctNotHSGrad for a different independent variable. Let’s use the variable population, which measures the proportion of adults who live in cities (rather than rural areas). Again, we can transform this proportion to a percentage with the following code:

communities$population <- communities$population * 100

Let’s fit a linear model using population as the independent variable:

model2 <- lm(PctUnemployed ~ population, data = communities)

lm(formula = PctUnemployed ~ population, data = communities)

    Min      1Q  Median      3Q     Max 
-35.252 -14.715  -3.946  11.054  64.980 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 35.02042    0.49206  71.171  < 2e-16 ***
population   0.23139    0.03532   6.552  7.2e-11 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.01 on 1992 degrees of freedom
Multiple R-squared:  0.0211,    Adjusted R-squared:  0.02061 
F-statistic: 42.93 on 1 and 1992 DF,  p-value: 7.201e-11

We can show regression line from the model2 just like we did with our first model.

  PctUnemployed ~ population, data = communities,
  xlab = "Adults living in cities (%)",
  ylab = "Unemployment (%)",
  frame.plot = FALSE,
  pch = 20,
  col = "LightSkyBlue"

abline(model2, lwd = 2, col = "red")

So we now have two models! Often, we will want to compare two estimated models side-by-side. We might want to say how the coefficients for the independent variables we included differ in model1 and model2, for example. Or we may want to ask: Does model2 offer a better fit than model1?

It is often useful to print the salient details from the estimated models side-by-side. We can do this by using the screenreg() function.

screenreg(list(model1, model2))

              Model 1      Model 2    
(Intercept)      7.90 ***    35.02 ***
                (0.65)       (0.49)   
PctNotHSGrad     0.74 ***             
population                    0.23 ***
R^2              0.55         0.02    
Adj. R^2         0.55         0.02    
Num. obs.     1994         1994       
RMSE            13.52        20.01    
*** p < 0.001, ** p < 0.01, * p < 0.05

What does this table tell us?

  • The first column replicates the results from our first model. We can see that a one point increase in the percentage of citizens without high-school education is associated with an increase of 0.74 percentage points of unemployment, on average.
  • The second column gives us the results from the second model. Here, a one point increase in the percentage of citizens who live in cities is associated with an increase of 0.23 percentage points of unemployment, on average
  • We can also compare the \(R^2\) values from the two models. The \(R^2\) for model1 is 0.55 and for model2 is 0.02. This suggests that the model with PctNotHSGrad as the explanatory variable explains about 55.30% of the variation in unemployment. The model with population as the explanatory variable, on the other hand, explains just 2.11% of the variation in unemployment.

Finally, and this is something that might help with your coursework, let’s save the same output as a Microsoft Word document using htmlreg().

htmlreg(list(model1, model2), file = "regression_model.doc")

If you’re using a Mac, you might want to save the file as .html if the Word document isn’t formatted correctly.

htmlreg(list(model1, model2), file = "regression_model.html")

4.1.3 Fitted values

Once we have estimated a regression model, we can use that model to produce fitted values. Fitted values represent our “best guess” for the value of our dependent variable for a specific value of our independent variable.

Let’s calculate the fitted values manually and then we’ll show you how to do it in R. The fitted value formula is:

\[\hat{Y}_{i} = \hat{\beta}_0 + \hat{\beta}_1 * X_i\]

Let’s say that, on the basis of model1 we would like to know what the unemployment rate is likely to be for a community where the percentage of adults without a high-school education is equal to 10%. We can substitute in the relevant coefficients from model1 and the value for our X variable (10 in this case), and we get:

\[\hat{Y}_{i} = 7.9 + 0.74 * 10 = 15.3\]

To calculate fitted values in R, we use the predict() function.

The predict function takes two main arguments.

Argument Description
object The object is the model object that we would like to use to produce fitted values. Here, we would like to base the analysis on model1 and so specify object = model1 here.
newdata This is an optional argument which we use to specify the values of our independent variable(s) that we would like fitted values for. If we leave this argument empty, R will automatically calculate fitted values for all of the observations in the data that we used to estimate the original model. If we include this argument, we need to provide a data.frame which has a variable with the same name as the independent variable in our model. Here, we specify newdata = data.frame(PctNotHSGrad = 10), as we would like the fitted value for a community where 10% of adults did not complete high-school.
predict(model1, newdata = data.frame(PctNotHSGrad = 10))

This is the same as the result we obtained when we calculated the fitted value manually. The good thing about the predict() function, however, is that we will be able to use it for all the models we study on this course, and it can be useful for calculating many different fitted values.

4.1.4 Additional Resources

4.1.5 Exercises

  1. Create a new file called assignment4.R in your PUBL0055 folder and write all the solutions in it.
  2. Load the non-western foreigners dataset from week 2.
  3. Estimate a model that explains subjective number of immigrants per 100 British citizens using only one independent variable. Justify your choice. (You do not have to pick the best variable but try to make a reasonable argument why more of x should lead to more/less of y).
  4. Plot a scatterplot of the relationship and add the regression line to the plot.
  5. Interpret the regression output and try to imagine that you are communicating your results to someone who does not know anything about statistics.
  6. Estimate another model (i.e. choose a different independent variable) on the same dependent variable. Justify the choice.
  7. Interpret the new regression output.
  8. Compare the two models and explain which one you would choose.
  9. Produce a table with both models next to each other in some text document. You can use texreg from the seminar, do it manually, or use something else.
  10. Consider the following table. This analysis asks whether individuals who have spent longer in education have higher yearly earnings. The analysis is based on a sample of 300 individuals. The dependent variable in this analysis is the yearly income of the individual in UK pounds (earnings). The independent variable measures the number of years the individual spent in full-time education (education).

                 Model 1     
    (Intercept)   3663.85    
    education     1270.81 ***
    R^2              0.17    
    Adj. R^2         0.17    
    Num. obs.      300       
    RMSE         14018.52    
    *** p < 0.001, ** p < 0.01, * p < 0.05
    1. Interpret the coefficient on the education variable.
    2. Using the values given in the table, calculate the test-statistic
    3. Can we reject the null hypothesis of no effect at the 95% confidence level? (Just looking at the stars is not sufficient here! How can we work out the result of the hypothesis test?)
  11. Save your script, which should now include the answers to all the exercises.
  12. Source your script, i.e. run the entire script all at once. Fix the script if you get any error messages.