# 10 Binary Dependent Variable Models

## 10.1 Overview

In the **lecture** this week we covered models for binary dependent variables. We learned about the advantages and disadvantages of the linear probability model as a method for analysing dichotomous outcome variables. We saw that while linear regression makes it easy to interpret the marginal effects of our explanatory variables on the probability of our outcome variable, the fitted values that linear regression can be problematic as they can be greater than 1 or less than 0. We then learned about an alternative model - the binary logistic regression model - which, rather than modelling the probability of the outcome directly, instead describes the effects of our explanatory variables on the *log-odds* of our outcome occurring. We saw how to interpret the coefficients for such a model, and noted that the magnitude of the effects of our X variables are a little hard to interpret given that they are expressed as log-odd ratios. As a consequence, we saw that a more straightforward way of interpreting the results of these models is to calculate predicted probabilies from the model for various values of our X variables. Finally, we saw that in terms of interpretation, the logistic regression model’s standard errors, test-statistics and p-values are all very similar to those we have studied for the linear regression model.

In **seminar** this week, we will:

- Implement some binary logistic regression models
- Interpret the resulting coefficients
- Calculate some fitted probabilities

**Before coming to the seminar**

- Please read…

## 10.2 Seminar

In a paper entitled “The Spousal Bump: Do Cross-Ethnic Marriages Increase Political Support in Multiethnic Democracies?”, a group of political scientists investigate whether African politicians are able to increase their electoral prospects by emphasising their partners’ ethnicity. As ethnicity is a key driver of vote choice in many African countries, the authors suggest that by appealing to a coethnic bond through their spouse, presidential candidates are able to send credible signals of multi-ethnic coalition building before an election. They therefore suggest that cross-ethnic marriages can therefore be used as a tool by political leaders to shore up support in multi-ethnic elections.

We will be using data from the Afrobarometer (a public attitude survey on democracy and governance in more than 35 countries in Africa) to investigate whether a political candidate can utilize his wife’s ethnicity to garner coethnic support.^{2} We will focus only African democracies where the president and wife are not of the same ethnicity (i.e., the president and wife are not coethnic with one another). We will investigate whether voters are more likely to vote for the a president when the voter shares the same ethnicity as the president’s wife.

The data file for this seminar is `afb_class.csv`

, which is a CSV file. Store this file in your `data`

folder as you have done in previous weeks. Then load the data into R. You will also need the `texreg`

package again this week.

The table below gives an overview of the variables included in the data.

Variable | Description |
---|---|

`country` |
A character variable indicating the country of the respondent |

`wifecoethnic` |
`1` if respondent is same ethnicity as president’s wife, and `0` otherwise |

`oppcoethnic` |
`1` if respondent is same ethnicity as main presidential opponent, and `0` otherwise |

`ethnicpercent` |
Respondent’s ethnic group fraction in respondent country |

`distance` |
Distance between respondent’s home and the home city of the president (measured in hundreds of miles) |

`vote` |
`1` if respondent would vote for the president, `0` otherwise |

### 10.2.1 Linear regression with a Binary Dependent Variable

Before moving on to the new model, we can illustrate some of the shortcomings of the linear regression model when working with binary outcome variables.

**Question 1**

Run a linear regression model (here, a *linear probability model*) with `vote`

as the dependent variable, and `distance`

is the only independent variable. Interpret the coefficient on the `distance`

variable

## Reveal answer

```
##
## ========================
## Model 1
## ------------------------
## (Intercept) 1.33 ***
## (0.01)
## distance -0.06 ***
## (0.00)
## ------------------------
## R^2 0.70
## Adj. R^2 0.70
## Num. obs. 4552
## ========================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```

The model shows that there is a strong and significant bivariate relationship between distance and vote choice for the president. Specifically, the model suggests that increasing the distance between the respondent and the president’s home city by 100 miles decreases the probability that the respondent will vote for the president by 6 percentage points on average.

**Question 2**

As discussed in lecture, the linear probability model (LPM) can lead to some odd conclusions with regard to fitted values. Plot the two variables used in the regression above, and add the estimated regression line. What does this plot tell you about the limitations of the LPM?

## Reveal answer

```
plot(
vote ~ distance,
data = afb,
col = "gray",
pch = 1,
xlab = "Distance",
ylab = "Vote for the president",
ylim = c(-.5, 1.5),
frame.plot = FALSE
)
abline(linear_model, col = "red", lwd = 2)
```

Because the functional form of the regression model is linear, the estimated relationship suggests that for respondents with a

`distance`

value greater than about 23 have anegativeprobability of voting for the president, and respondents with a`distance`

value less than about 5 have a probability of voting for the president that isgreater than 1. This is clearly unsatisfactory, as probabilities must always be between 0 and 1!

### 10.2.2 Logistic Regression Model

To overcome the issue revealed above, we can use a different regression model: the logistic regression model. To estimate this model, we use the generalized linear model function `glm()`

. The syntax is very similar to the `lm`

regression function that we are already familiar with, but there is an additional argument that we need to specify (the `family`

argument) in order to tell R that we would like to estimate a logistic regression model.

Argument | Description |
---|---|

`formula` |
Describes the relationship between the dependent and independent variables we wish to estimate, for example `dependent.variable ~ independent.variable` . |

`data` |
The name of the dataset that contains the variable of interest. |

`family` |
The `family` argument provides a description of the type of regression model we would like to estimate. For a binary logistic regression model we use `family = binomial(link = "logit")` . |

**Question 3a**

Estimate a logistic regression model using the `afb`

data, where the outcome variable is `vote`

and the explanatory variables are `wifecoethnic`

and `distance`

.

## Reveal answer

```
logit_model <- glm(
vote ~ wifecoethnic + distance,
data = afb,
family = binomial(link = "logit")
)
screenreg(logit_model)
```

```
##
## ===========================
## Model 1
## ---------------------------
## (Intercept) 11.66 ***
## (0.42)
## wifecoethnic -1.36 ***
## (0.17)
## distance -0.79 ***
## (0.03)
## ---------------------------
## AIC 1364.65
## BIC 1383.92
## Log Likelihood -679.33
## Deviance 1358.65
## Num. obs. 4552
## ===========================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```

**Question 3b**

Interpret the output of the logistic regression model, focusing on the coefficients associated with the `wifecoethnic`

and `distance`

## Reveal answer

Interpreting the output of a logistic regression model is less straightforward than for the linear model, because the coefficients no longer describe the effect of a unit change in X on Y. Instead, the direct interpretation of the coefficient is: a one unit change in X is associated with a \(\hat{\beta}\) change in the log-odds of Y, holding constant other variables. Here, the coefficient on

`wifecoethnic`

is equal to -1.36, implying that the log-odds of voting for the president are 1.36lowerwhen the respondent has the same ethnicity as the president’s wife, holding constant distance.

The interpretation of the significance of the coefficients remains unchanged from the linear regression model. For example, the standard error for the coefficient on

`wifecoethnic`

is 0.17, and the test statistic is therefore -1.36/0.17 = -8. This is much greater than the critical value of any conventionally used \(\alpha\)-level and so we can be sure that this result is statistically significant.

Differences in the log-odds, however, are difficult to interpret substantively. Therefore, the main approach to describing the substantive relationships that emerge from a logistic regression model is to calculate predicted probabilities.

### 10.2.3 Predicted probabilities

We can use the `predict()`

function to calculate fitted values for the logistic regression model, just as we did for the linear model. Here, however, we need to take into account the fact that we model the *log-odds* that \(Y = 1\), rather than the *probability* that \(Y=1\). The `predict()`

function will therefore, by default, give us predictions for Y on the log-odds scale. To get predictions on the probability scale, we need to add an additional argument to `predict()`

: we set the `type`

argument to `type = "response"`

.

**Question 4** Calculate the predicted probability of voting for the president for a respondent who shares the same ethnicity as the president’s wife, and lives 1000 miles from the president’s home city. How does this predicted probability compare to the predicted probability of voting for the president for a respondent who *is not* coethnic with the president’s wife, and lives 1000 miles from the president’s home city?

## Reveal answer

```
pred_prob_1 <- predict(
logit_model,
newdata = data.frame(wifecoethnic = 1, distance = 10),
type = "response"
)
pred_prob_1
```

```
## 1
## 0.91698
```

```
pred_prob_2 <- predict(
logit_model,
newdata = data.frame(wifecoethnic = 0, distance = 10),
type = "response"
)
pred_prob_2
```

```
## 1
## 0.9773743
```

Comparing the two predicted probabilities, the model tells us that for respondents who live 1000 miles from the president’s home city, sharing the ethnicity of the president’s wife

decreasesthe probability of voting for the president by about 6 points:

```
## 1
## -0.0603943
```

**Question 5**

What do the results from question 4 tell you about the research question of interest here? Do they support the assertion that cross-ethnic marriages increase political support?

## Reveal answer

They do not! Notice that this “finding” is the opposite of the prediction from the theory: respondents do not seem to be more likely to vote for the president if they share the same ethnicity as the president’s wife. Of course, here we are dealing with a very simple model with many possible confounding variables that we have not included in the model.

**Question 6a**

The logistic regression model implies a *non-linear* relationship between the X variables and the outcome. To see this more clearly, calculate the probability of voting for the president over the entire range of the `distance`

variable. Provide a plot with `distance`

on the X-axis, and the predicted probabilities on the Y-axis. Interpret your results

## Reveal answer

```
## Set the values for the explanatory variables
wifecoethnic_profiles <- data.frame(
distance = seq(from = 0, to = 34, by = .5),
wifecoethnic = 1
)
head(wifecoethnic_profiles)
```

```
## distance wifecoethnic
## 1 0.0 1
## 2 0.5 1
## 3 1.0 1
## 4 1.5 1
## 5 2.0 1
## 6 2.5 1
```

Here, we have set the

`distance`

variable to vary between 0 and 34, with increments of .5 units and we have set`wifecoethnic`

to be equal to 1. We have then put all of these values into a new`data.frame`

called`wifecoethnic_profiles`

which we will pass to the`predict()`

function.

```
wifecoethnic_profiles$predicted_probs <- predict(
logit_model, newdata = wifecoethnic_profiles,
type = "response"
)
```

Finally, we can plot these values:

```
plot(
predicted_probs ~ distance,
data = wifecoethnic_profiles,
xlab = "Distance",
ylab = "Probability of voting for the president",
col = "gray",
type = "l", # type = "l" will produce a line plot, rather than the default scatter plot
frame.plot = FALSE,
lwd = 3 # lwd = 3 will increase the thinkness of the line on the plot
)
```

The plot nicely illustrates the non-linear functional form of the logistic regression model. As desired, all of the predicted probabilities now vary between 0 and 1, as the line takes on a distinctive “S” shape. It is clear from this plot that X (

`distance`

) is non-linearly related to the probability that \(Y=1\) (\(P(Y = 1) = \pi\)): the same change in X results in difference changes in \(\pi\) depending on which values of X we consider. For example:

- Increasing
`distance`

from 5 to 10 leads to a decrease in \(\pi\) of only a very small amount- Increasing
`distance`

from 10 to 15 leads to a decrease in \(\pi\) of a very large amount

This is why we are unable to interpret the \(\beta\) coefficients from the logistic model as constant increases or decreases in \(\pi\) given a change in X. For any given change in X, the amount that \(\pi\) will change will depend on the starting value of X that we are considering.

**Question 6b**

Calculate the difference in predicted probability between respondents who share the ethnic group of the president’s wife and respondents who come from different ethnic groups than the president’s wife, *at different values of the distance variable*. Calculate this difference for

`distance = 10`

and `distance = 12`

:## Reveal answer

```
coethnic_dist_10 <- predict(
logit_model,
newdata = data.frame(wifecoethnic = 1, distance = 10),
type = "response"
)
not_coethnic_dist_10 <- predict(
logit_model,
newdata = data.frame(wifecoethnic = 0, distance = 10),
type = "response"
)
coethnic_dist_10 - not_coethnic_dist_10
```

```
## 1
## -0.0603943
```

```
coethnic_dist_12 <- predict(
logit_model,
newdata = data.frame(wifecoethnic = 1, distance = 12),
type = "response"
)
not_coethnic_dist_12 <- predict(
logit_model,
newdata = data.frame(wifecoethnic = 0, distance = 12),
type = "response"
)
coethnic_dist_12 - not_coethnic_dist_12
```

```
## 1
## -0.2042512
```

Again, these results reveal the non-linear nature of the logistic regression model. The results indicate that, when comparing respondents with values of 10 on the district variable, coethnics of the president’s wife are 6 points less likey to vote for the president than non-coethnics. However, when comparing respondents with values of 12 on the distance variable, coethnics are 20 points less likely to vote for the president. Therefore, this illustrates that in a multiple logistic regression model the change in \(\pi\) in response to even exactly the same change in one X variable depends on the values at which the other X variables are fixed.

All the candidates in this dataset are male, and all have female partners.↩︎