10 Binary Dependent Variable Models

10.1 Overview

In the lecture this week we covered models for binary dependent variables. We learned about the advantages and disadvantages of the linear probability model as a method for analysing dichotomous outcome variables. We saw that while linear regression makes it easy to interpret the marginal effects of our explanatory variables on the probability of our outcome variable, the fitted values that linear regression can be problematic as they can be greater than 1 or less than 0. We then learned about an alternative model - the binary logistic regression model - which, rather than modelling the probability of the outcome directly, instead describes the effects of our explanatory variables on the log-odds of our outcome occurring. We saw how to interpret the coefficients for such a model, and noted that the magnitude of the effects of our X variables are a little hard to interpret given that they are expressed as log-odd ratios. As a consequence, we saw that a more straightforward way of interpreting the results of these models is to calculate predicted probabilies from the model for various values of our X variables. Finally, we saw that in terms of interpretation, the logistic regression model’s standard errors, test-statistics and p-values are all very similar to those we have studied for the linear regression model.

In seminar this week, we will:

  1. Implement some binary logistic regression models
  2. Interpret the resulting coefficients
  3. Calculate some fitted probabilities

Before coming to the seminar

  1. Please read…

10.2 Seminar

In a paper entitled “The Spousal Bump: Do Cross-Ethnic Marriages Increase Political Support in Multiethnic Democracies?”, a group of political scientists investigate whether African politicians are able to increase their electoral prospects by emphasising their partners’ ethnicity. As ethnicity is a key driver of vote choice in many African countries, the authors suggest that by appealing to a coethnic bond through their spouse, presidential candidates are able to send credible signals of multi-ethnic coalition building before an election. They therefore suggest that cross-ethnic marriages can therefore be used as a tool by political leaders to shore up support in multi-ethnic elections.

We will be using data from the Afrobarometer (a public attitude survey on democracy and governance in more than 35 countries in Africa) to investigate whether a political candidate can utilize his wife’s ethnicity to garner coethnic support.2 We will focus only African democracies where the president and wife are not of the same ethnicity (i.e., the president and wife are not coethnic with one another). We will investigate whether voters are more likely to vote for the a president when the voter shares the same ethnicity as the president’s wife.

The data file for this seminar is afb_class.csv, which is a CSV file. Store this file in your data folder as you have done in previous weeks. Then load the data into R. You will also need the texreg package again this week.

library(texreg) 
afb <- read.csv("data/afb_class.csv")

The table below gives an overview of the variables included in the data.

Variable Description
country A character variable indicating the country of the respondent
wifecoethnic 1 if respondent is same ethnicity as president’s wife, and 0 otherwise
oppcoethnic 1 if respondent is same ethnicity as main presidential opponent, and 0 otherwise
ethnicpercent Respondent’s ethnic group fraction in respondent country
distance Distance between respondent’s home and the home city of the president (measured in hundreds of miles)
vote 1 if respondent would vote for the president, 0 otherwise

10.2.1 Linear regression with a Binary Dependent Variable

Before moving on to the new model, we can illustrate some of the shortcomings of the linear regression model when working with binary outcome variables.

Question 1

Run a linear regression model (here, a linear probability model) with vote as the dependent variable, and distance is the only independent variable. Interpret the coefficient on the distance variable

Reveal answer

linear_model <- lm(vote ~ distance, data = afb)
screenreg(linear_model)
## 
## ========================
##              Model 1    
## ------------------------
## (Intercept)     1.33 ***
##                (0.01)   
## distance       -0.06 ***
##                (0.00)   
## ------------------------
## R^2             0.70    
## Adj. R^2        0.70    
## Num. obs.    4552       
## ========================
## *** p < 0.001; ** p < 0.01; * p < 0.05

The model shows that there is a strong and significant bivariate relationship between distance and vote choice for the president. Specifically, the model suggests that increasing the distance between the respondent and the president’s home city by 100 miles decreases the probability that the respondent will vote for the president by 6 percentage points on average.

Question 2

As discussed in lecture, the linear probability model (LPM) can lead to some odd conclusions with regard to fitted values. Plot the two variables used in the regression above, and add the estimated regression line. What does this plot tell you about the limitations of the LPM?

Reveal answer

plot(
  vote ~ distance, 
  data = afb,
  col = "gray",
  pch = 1,
  xlab = "Distance", 
  ylab = "Vote for the president",
  ylim = c(-.5, 1.5),
  frame.plot = FALSE
)

abline(linear_model, col = "red", lwd = 2)

Because the functional form of the regression model is linear, the estimated relationship suggests that for respondents with a distance value greater than about 23 have a negative probability of voting for the president, and respondents with a distance value less than about 5 have a probability of voting for the president that is greater than 1. This is clearly unsatisfactory, as probabilities must always be between 0 and 1!

10.2.2 Logistic Regression Model

To overcome the issue revealed above, we can use a different regression model: the logistic regression model. To estimate this model, we use the generalized linear model function glm(). The syntax is very similar to the lm regression function that we are already familiar with, but there is an additional argument that we need to specify (the family argument) in order to tell R that we would like to estimate a logistic regression model.

Argument Description
formula Describes the relationship between the dependent and independent variables we wish to estimate, for example dependent.variable ~ independent.variable.
data The name of the dataset that contains the variable of interest.
family The family argument provides a description of the type of regression model we would like to estimate. For a binary logistic regression model we use family = binomial(link = "logit").

Question 3a

Estimate a logistic regression model using the afb data, where the outcome variable is vote and the explanatory variables are wifecoethnic and distance.

Reveal answer

logit_model <- glm(
  vote ~ wifecoethnic + distance,
  data = afb,
  family = binomial(link = "logit")
)
screenreg(logit_model)
## 
## ===========================
##                 Model 1    
## ---------------------------
## (Intercept)       11.66 ***
##                   (0.42)   
## wifecoethnic      -1.36 ***
##                   (0.17)   
## distance          -0.79 ***
##                   (0.03)   
## ---------------------------
## AIC             1364.65    
## BIC             1383.92    
## Log Likelihood  -679.33    
## Deviance        1358.65    
## Num. obs.       4552       
## ===========================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Question 3b

Interpret the output of the logistic regression model, focusing on the coefficients associated with the wifecoethnic and distance

Reveal answer

Interpreting the output of a logistic regression model is less straightforward than for the linear model, because the coefficients no longer describe the effect of a unit change in X on Y. Instead, the direct interpretation of the coefficient is: a one unit change in X is associated with a \(\hat{\beta}\) change in the log-odds of Y, holding constant other variables. Here, the coefficient on wifecoethnic is equal to -1.36, implying that the log-odds of voting for the president are 1.36 lower when the respondent has the same ethnicity as the president’s wife, holding constant distance.

The interpretation of the significance of the coefficients remains unchanged from the linear regression model. For example, the standard error for the coefficient on wifecoethnic is 0.17, and the test statistic is therefore -1.36/0.17 = -8. This is much greater than the critical value of any conventionally used \(\alpha\)-level and so we can be sure that this result is statistically significant.

Differences in the log-odds, however, are difficult to interpret substantively. Therefore, the main approach to describing the substantive relationships that emerge from a logistic regression model is to calculate predicted probabilities.

10.2.3 Predicted probabilities

We can use the predict() function to calculate fitted values for the logistic regression model, just as we did for the linear model. Here, however, we need to take into account the fact that we model the log-odds that \(Y = 1\), rather than the probability that \(Y=1\). The predict() function will therefore, by default, give us predictions for Y on the log-odds scale. To get predictions on the probability scale, we need to add an additional argument to predict(): we set the type argument to type = "response".

Question 4 Calculate the predicted probability of voting for the president for a respondent who shares the same ethnicity as the president’s wife, and lives 1000 miles from the president’s home city. How does this predicted probability compare to the predicted probability of voting for the president for a respondent who is not coethnic with the president’s wife, and lives 1000 miles from the president’s home city?

Reveal answer

pred_prob_1 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 1, distance = 10), 
  type = "response"
)
pred_prob_1
##       1 
## 0.91698
pred_prob_2 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 0, distance = 10), 
  type = "response"
)
pred_prob_2
##         1 
## 0.9773743

Comparing the two predicted probabilities, the model tells us that for respondents who live 1000 miles from the president’s home city, sharing the ethnicity of the president’s wife decreases the probability of voting for the president by about 6 points:

pred_prob_1 - pred_prob_2
##          1 
## -0.0603943

Question 5

What do the results from question 4 tell you about the research question of interest here? Do they support the assertion that cross-ethnic marriages increase political support?

Reveal answer

They do not! Notice that this “finding” is the opposite of the prediction from the theory: respondents do not seem to be more likely to vote for the president if they share the same ethnicity as the president’s wife. Of course, here we are dealing with a very simple model with many possible confounding variables that we have not included in the model.

Question 6a

The logistic regression model implies a non-linear relationship between the X variables and the outcome. To see this more clearly, calculate the probability of voting for the president over the entire range of the distance variable. Provide a plot with distance on the X-axis, and the predicted probabilities on the Y-axis. Interpret your results

Reveal answer

## Set the values for the explanatory variables
wifecoethnic_profiles <- data.frame(
  distance = seq(from = 0, to = 34, by = .5),
  wifecoethnic = 1
)
head(wifecoethnic_profiles)
##   distance wifecoethnic
## 1      0.0            1
## 2      0.5            1
## 3      1.0            1
## 4      1.5            1
## 5      2.0            1
## 6      2.5            1

Here, we have set the distance variable to vary between 0 and 34, with increments of .5 units and we have set wifecoethnic to be equal to 1. We have then put all of these values into a new data.frame called wifecoethnic_profiles which we will pass to the predict() function.

wifecoethnic_profiles$predicted_probs <- predict(
  logit_model, newdata = wifecoethnic_profiles, 
  type = "response"
)

Finally, we can plot these values:

plot(
  predicted_probs ~ distance, 
  data = wifecoethnic_profiles,
  xlab = "Distance",
  ylab = "Probability of voting for the president", 
  col = "gray", 
  type = "l", # type = "l" will produce a line plot, rather than the default scatter plot
  frame.plot = FALSE, 
  lwd = 3 # lwd = 3 will increase the thinkness of the line on the plot
)

The plot nicely illustrates the non-linear functional form of the logistic regression model. As desired, all of the predicted probabilities now vary between 0 and 1, as the line takes on a distinctive “S” shape. It is clear from this plot that X (distance) is non-linearly related to the probability that \(Y=1\) (\(P(Y = 1) = \pi\)): the same change in X results in difference changes in \(\pi\) depending on which values of X we consider. For example:

  • Increasing distance from 5 to 10 leads to a decrease in \(\pi\) of only a very small amount
  • Increasing distance from 10 to 15 leads to a decrease in \(\pi\) of a very large amount

This is why we are unable to interpret the \(\beta\) coefficients from the logistic model as constant increases or decreases in \(\pi\) given a change in X. For any given change in X, the amount that \(\pi\) will change will depend on the starting value of X that we are considering.

Question 6b

Calculate the difference in predicted probability between respondents who share the ethnic group of the president’s wife and respondents who come from different ethnic groups than the president’s wife, at different values of the distance variable. Calculate this difference for distance = 10 and distance = 12:

Reveal answer

coethnic_dist_10 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 1, distance = 10), 
  type = "response"
)

not_coethnic_dist_10 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 0, distance = 10), 
  type = "response"
)

coethnic_dist_10 - not_coethnic_dist_10
##          1 
## -0.0603943
coethnic_dist_12 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 1, distance = 12), 
  type = "response"
)

not_coethnic_dist_12 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 0, distance = 12), 
  type = "response"
)

coethnic_dist_12 - not_coethnic_dist_12
##          1 
## -0.2042512

Again, these results reveal the non-linear nature of the logistic regression model. The results indicate that, when comparing respondents with values of 10 on the district variable, coethnics of the president’s wife are 6 points less likey to vote for the president than non-coethnics. However, when comparing respondents with values of 12 on the distance variable, coethnics are 20 points less likely to vote for the president. Therefore, this illustrates that in a multiple logistic regression model the change in \(\pi\) in response to even exactly the same change in one X variable depends on the values at which the other X variables are fixed.


  1. All the candidates in this dataset are male, and all have female partners.↩︎