9 Binary Dependent Variable Models


9.1 Seminar

Create a new script with the following lines at the top, and save it as seminar9.R

rm(list = ls())
setwd("~/PUBL0055")

9.1.1 Required Packages

Let’s start by loading the required packages:

library(texreg) 

9.1.2 Loading Data

We will be using data from the Afrobarometer (a public attitude survey on democracy and governance in more than 35 countries in Africa) to investigate whether a political candidate can utilize his wife’s ethnicity to garner coethnic support. This is linked to a broad research theory which implies that voters in Africa prefer to vote for candidates of their own ethnic group, and that if a presidential candidate marries a non-coethnic wife, this may broaden their electoral appeal to other ethnic groups.

We will focus only African democracies where the president and wife are not of the same ethnicity are considered (i.e., the president and wife are not coethnic with one another). We will investigate whether voters are more likely to vote for the a president when the voter shares the same ethnicity as the president’s wife.

First, let’s load the data:

afb <- read.csv("afb_class.csv")

The table below gives an overview of the variables included in the data.

Variable Description
country A character variable indicating the country of the respondent
wifecoethnic 1 if respondent is same ethnicity as president’s wife, and 0 otherwise
oppcoethnic 1 if respondent is same ethnicity as main presidential opponent, and 0 otherwise
ethnicpercent Respondent’s ethnic group fraction in respondent country
distance Distance between respondent’s home and the home city of the president (measured in hundreds of miles)
vote 1 if respondent would vote for the president, 0 otherwise

Now take a look at the first few observations to see what the dataset looks like

head(afb)
  country wifecoethnic oppcoethnic ethnicpercent vote  distance
1   benin            1           1     0.4173623    1  5.449764
2   benin            1           1     0.4173623    0 18.510687
3   benin            1           1     0.4173623    0 26.370297
4   benin            1           1     0.4173623    0 21.012341
5   benin            1           1     0.4173623    0 22.238316
6   benin            1           1     0.4173623    1  1.822167

9.1.3 Linear regression with a Binary Dependent Variable

Before moving on to the new model, we can illustrate some of the shortcomings of the linear regression model when working with binary outcome variables.

Let’s run a linear regression (here, a linear probability model) where vote is our dependent variable, and distance is our independent variable. Try doing this yourself before revealing the solution code below.

linear_model <- lm(vote ~ distance, data = afb)
screenreg(linear_model)

========================
             Model 1    
------------------------
(Intercept)     1.33 ***
               (0.01)   
distance       -0.06 ***
               (0.00)   
------------------------
R^2             0.70    
Adj. R^2        0.70    
Num. obs.    4552       
RMSE            0.27    
========================
*** p < 0.001, ** p < 0.01, * p < 0.05

The model shows that there is a strong and significant bivariate relationship between distance and vote choice for the president. Specifically, the model suggest that increasing the distance between the respondent and the president’s home city by 100 miles decreases the probability that the respondent will vote for the president by 6 percentage points on average.

As we discussed in lecture, however, the linear probability model can lead to some odd conclusions with regard to fitted values. Let’s plot the two variables above, and include the estimated regression line.

plot(
  vote ~ distance, 
  data = afb,
  col = "LightSkyBlue",
  xlab = "Distance", 
  ylab = "Vote for the president",
  ylim = c(-.5, 1.5),
  frame.plot = FALSE
)

abline(linear_model, col = "red")

As we can see from the plot, because the functional form of the linear probability model is linear, the estimated relationsip suggests that for respondents with a distance value greater than about 23 have a negative probability of voting for the president, and respondents with a distance value less than about 5 have a probability of voting for the president that is greater than 1. This is clearly unsatisfactory, as probabilities must always be between 0 and 1!

Now create a plot of the residuals from the model on the Y-axis, and the distance variable on the X-axis. Try this yourself before revealing the code below.

plot(
  x = afb$distance, 
  y = residuals(linear_model), 
  col = "LightSkyBlue",
  xlab = "Distance", 
  ylab = "Residuals",
  frame.plot = FALSE
)

abline(h = 0, col = "red")

We can see from this plot that the linear probability model is always heteroskedastic – the errors are not evenly distributed around zero for all values of distance.

Because of these problems, we now move to implement a limited dependent varible model – the logistic regression model – instead.

9.1.4 Logistic Regression Model

We use the generalized linear model function glm() to estimate a logistic regression. The syntax is very similar to the lm regression function that we are already familiar with, but there is an additional argument that we need to specify (the family argument) in order to tell R that we would like to estimate a logistic regression model.

Argument Description
formula As before, the formula describes the relationship between the dependent and independent variables, for example dependent.variable ~ independent.variable
In our case, we will use the formula: vote ~ wifecoethnic + distance
data Again as before, this is simply the name of the dataset that contains the variable of interest. In our case, this is the dataset called afb.
family The family argument provides a description of the error distribution and link function to be used in the model. For our purposes, we would like to estimate a binary logistic regression model and so we set family = binomial(link = "logit")

We tell glm() that we’ve binary dependent variable and we want to use the logistic link function using the family = binomial(link = "logit") argument:

logit_model <- glm(
  vote ~ wifecoethnic + distance,
  data = afb,
  family = binomial(link = "logit")
)
screenreg(logit_model)

===========================
                Model 1    
---------------------------
(Intercept)       11.66 ***
                  (0.42)   
wifecoethnic      -1.36 ***
                  (0.17)   
distance          -0.79 ***
                  (0.03)   
---------------------------
AIC             1364.65    
BIC             1383.92    
Log Likelihood  -679.33    
Deviance        1358.65    
Num. obs.       4552       
===========================
*** p < 0.001, ** p < 0.01, * p < 0.05

Interpreting the output of a logistic regression model is less straightforward than for the linear model, because the coefficients no longer describe the effect of a unit change in X on Y. Instead, the direct interpretation of the coefficient is: a one unit change in X is associated with a \(\hat{\beta}\) change in the log-odds of Y, holding constant other variables. Here, the coefficient on wifecoethnic is equal to -1.36, implying that the log-odds of voting for the president are 1.36 lower when the respondent has the same ethnicity as the president’s wife, holding constant distance.

The interpretation of the significance of the coefficients remains unchanged from the linear regression model. For example, the standard error for the coefficient on wifecoethnic is 0.17, and the test statistic is therefore -1.36/0.17 = -8. This is much greater than the critical value of any conventionally used \(\alpha\)-level and so we can be sure that this result is statistically significant.

9.1.5 Predicted probabilities

Differences in the log-odds, however, are difficult to interpret substantively. Therefore, the main approach to describing the substantive relationships that emerge from a logistic regression model is to calculate predicted probabilities.

We can use the predict() function to calculate fitted values for the logistic regression model, just as we did for the linear model. Here, however, we need to take into account the fact that we model the log-odds that \(Y = 1\), rather than the probability that \(Y=1\). The predict() function will therefore, by default, give us predictions for Y on the log-odds scale. To get predictions on the probability scale, we need to add an additional argument to predict(): we set the type argument to type = "response".

For example, if we would like to calculate the predicted probability of voting for the president for a respondent who shares the same ethnicity as the president’s wife, and lives 1000 miles from the president’s home city, we could use:

pred_prob_1 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 1, distance = 10), 
  type = "response"
)
pred_prob_1
      1 
0.91698 

How does this predicted probability compare to the predicted probability of voting for the president for a respondent who is not coethnic with the president’s wife, and lives 1000 miles from the president’s home city? We can just use the predict() function again, but specify a different value for wifecoethnic to use for the prediction:

pred_prob_2 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 0, distance = 10), 
  type = "response"
)
pred_prob_2
        1 
0.9773743 

Comparing the two predicted probabilities, the model tells us that for respondents who live 100 miles from the president’s home city, sharing the ethnicity of the president’s wife decreases the probability of voting for the president by about 6 points:

pred_prob_1 - pred_prob_2
         1 
-0.0603943 

Notice that this “finding” is the opposite of the prediction from the theory: respondents do not seem to be more likely to vote for the president if they share the same ethnicity as the president’s wife. Of course, here we are dealing with a very simple model with many possible confounding variables that we have not included in the model!

As discussed in lecture, the logistic model implies a non-linear relationship between the X variables and the outcome. To see this more clearly, we can calulate the probability of voting for the president over the entire range of the distance variable.

## Set the values for the explanatory variables
wifecoethnic_profiles <- data.frame(
  distance = seq(from = 0, to = 34, by = .5),
  wifecoethnic = 1
)
head(wifecoethnic_profiles)
  distance wifecoethnic
1      0.0            1
2      0.5            1
3      1.0            1
4      1.5            1
5      2.0            1
6      2.5            1

Here, we have set the distance variable to vary between 0 and 34, with increments of .5 units and we have set wifecoethnic to be equal to 1. We have then put all of these values into a new data.frame called wifecoethnic_profiles which we will pass to the predict() function.

wifecoethnic_profiles$predicted_probs <- predict(
  logit_model, newdata = wifecoethnic_profiles, 
  type = "response"
)

Finally, we can plot these values:

plot(
  predicted_probs ~ distance, 
  data = wifecoethnic_profiles,
  xlab = "Distance",
  ylab = "Probability of voting for the president", 
  col = "LightSkyBlue", 
  type = "l", # type = "l" will produce a line plot, rather than the default scatter plot
  frame.plot = FALSE, 
  lwd = 3 # lwd = 3 will increase the thinkness of the line on the plot
)

The plot nicely illustrates the non-linear functional form of the logistic regression model. As desired, all of the predicted probabilities now vary between 0 and 1, as the line takes on a distinctive “S” shape. It is clear from this plot that X (distance) is non-linearly related to the probability that \(Y=1\) (\(P(Y = 1) = \pi\)): the same change in X results in difference changes in \(\pi\) depending on which values of X we consider. For example:

  • Increasing distance from 5 to 10 leads to a decrease in \(\pi\) of only a very small amount
  • Increasing distance from 10 to 15 leads to a decrease in \(\pi\) of a very large amount

This is why we are unable to interpret the \(\beta\) coefficients from the logistic model as constant increases or decreases in \(\pi\) given a change in X. For any given change in X, the amount that \(\pi\) will change will depend on the starting value of X that we are considering.

Finally, we can also see the non-linearity inherent in the logistic model by calculating the difference in predicted probability between respondents who share the ethnic group of the president’s wife and respondents who come from different ethnic groups than the president’s wife, at different values of the distance variable. Let’s calculate this difference for distance = 10 and distance = 12:

coethnic_dist_10 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 1, distance = 10), 
  type = "response"
)

not_coethnic_dist_10 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 0, distance = 10), 
  type = "response"
)

coethnic_dist_10 - not_coethnic_dist_10
         1 
-0.0603943 
coethnic_dist_12 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 1, distance = 12), 
  type = "response"
)

not_coethnic_dist_12 <- predict(
  logit_model, 
  newdata = data.frame(wifecoethnic = 0, distance = 12), 
  type = "response"
)

coethnic_dist_12 - not_coethnic_dist_12
         1 
-0.2042512 

The results indicating that, when comparing respondents with values of 10 on the district variable, coethnics of the president’s wife are 6 points less likey to vote for the president than non-coethnics. However, when comparing respondents with values of 12 on the distance variable, coethnics are 20 points less likely to vote for the president. Therefore, this illustrates that in a multiple logistic regression model the change in \(\pi\) in response to even exactly the same change in one X variable depends on the values at which the other X variables are fixed.

9.1.6 Predicted Probabilities and Predictive Power

To assess the predictive power of our model we will check the percentage of cases that it correctly predicted. Let’s look at our dependent variable first and find out what the naive guess is.

mean(afb$vote)
[1] 0.5463533

The mean is \(0.55\). That means that 55% of the respondents said that they would vote for for the president. If we had no model and were to predict the outcome for a random respondent from our sample, the prediction that we should make is: the voter will vote for the president. The naive guess is therefore \(1\) - the best prediction without a statistical model. If we were to predict \(1\) for every respondent in the sample based on the naive model, then we would have a percent of correctly predicted cases of 55%, the same as the mean of the outcome variable.

Given that we can correctly predict 55% of observations without any model at all, we should hope that the statistical model that we estimated earlier would do better than this.

We will now estimate predicted probabilities, expected values and estimate our rate of correctly predicted cases for the model estimated above.

Step 1: Predicted probabilities.

We use the predict() function and set type = "response". We attach the predicted probabilities for every respondent to the original dataset.

afb$pps <- predict(logit_model, newdata = afb, type = "response")

Step 2: Expected values.

Now, that we have predicted probabilities of voting for the president, we will assign everyone in our dataset who has a predicted probability greater than \(0.5\) with an expected value of 1 and everyone else get’s a \(0\).

afb$evs <- ifelse(afb$pps > 0.5, yes = 1, no = 0)

Step 3: Compute percentage correctly predicted.

For every respondent in our dataset, we have made a prediction based on our model logit_model. We will compare our predictions to the actual outcomes now in a cross-table (sometimes called confusion matrix).

confusion <- table(actual = afb$vote, expected.value = afb$evs)
confusion
      expected.value
actual    0    1
     0 1928  137
     1  140 2347

The observations on the diagonal are our correct predictions where we predicted a vote for the president and the respondent actually did vote for the president, or where we predicted a vote against the president and the respondent actually did vote against the president.

For 140 respondents, we predict that they will not vote for the president but they respond saying that they will vote for the president. These are what we call false negatives.

For 137 respondents, we predict that they will vote for the president but they respond saying that they will not vote for the president. These are what we call false positives.

To compute the percentage of correctly classified cases, we now divide the elements on the diagonal of the confusion matrix by all respondents. We use the diag() function and the sum() function to do this. The former takes the diagonal elements of the table and the latter takes a sum.

sum(diag(confusion)) / sum(confusion)
[1] 0.9391476

We correctly classify 94% of the respondents. This is a large improvement on the naive guess calculation, which correctly classifies only 55% of the respondents. We can also use this type of calculation to decide between models: first, calculate the percentage correctly predicted for one model, then for a second model. The model that has a higher percentage correctly predicted has greater explanatory power than the model with the lower percentage correctly predicted.

9.1.7 Exercises

We will be using data from a post-referendum survey among people living in Switzerland. The survey was carried out right after the vote on an initiative whether to ban military assault rifles in private households or not. If you would like to get some more background you may consult the wikipedia article on the referendum, here.

  1. Create a new file called assignment9.R in your PUBL0055 folder and write all the solutions in it.
  2. Load the Swiss gun control survey dataset SwissData2011.dta from your PUBL0055 folder or from the datasets repository:

    library(foreign)
    swiss <- read.dta("https://uclspp.github.io/datasets/data/SwissData2011.dta")

    Codebook:

    Variable Description
    VoteYes 1 if someone voted yes and is 0 if someone voted no.
    participation 1 for people that voted and 0 otherwise.
    male 1 for men and 0 for women.
    age Age in years.
    LeftRight Left-Right self placement where low values indicate that a respondent is more to the left.
    GovTrust Trust in government. Little or no trust is -1, neither YES nor NO is 0, and +1 if somebody trusts the government.
    ReligFreq How frequently does a respondent attend a religious service? Never (0), only on special occasions (1), several times a year (2), once a month (3), and once a week (4).
    university Binary indicator of whether respondent has a university degree (1) or not (0).
    party Indicates which party a respondent supports. Liberals (1), Christian Democrats (2), Social Democrats (3), Conservative Right (4), and Greens (5).
    income Income measured in ten different brackets (higher values indicate higher income). You may assume that this variable is an interval type measure.
    german Binary indicator of whether respondent’s mother tongue is German (1) or not (0)
    suburb Binary indicator of whether respondent lives in a suburban neighborhood (1) or not (0)
    urban Binary indicator of whether respondent lives in a city (1) or not (0)
    cars Number of cars the respondent’s household owns.
    old_voter Variable indicating whether a respondent is older than 60 years (1) or not (0).
    cantonnr Variable indicating in which of the 26 Swiss cantons a respondent lives.
    nodenomination Share of citizens in a canton that do not have a denomination.
    urbanization Share of citizens in a canton that live in urban areas.
  3. Create plots of age vs VoteYes and LeftRight vs VoteYes. Add a regression line (from a linear probability model) to the plots
  4. Estimate a logistic regression model which includes age and LeftRight as predictors, and VoteYes as the dependent variable
  5. What is the direction of the relationship between LeftRight on VoteYes? Is this association significant?
  6. What is the direction of the relationship between age on VoteYes? Is this association significant?
  7. What is the effect on the probability of a “yes” vote of moving from a left-right self placement of 5 to a self placement of 6 for an individual who is 44 years old?
  8. Calculate and plot the predicted probability of voting yes across the range of the age variable for individuals who have a left-right self placement of 5. Do the same across the range of the LeftRight variable for individuals with an age of 50.
  9. Looking at the other variables in the swiss data, which do you think might be important determinants of a “yes” vote in the referendum? Write a short paragraph justifying the importance of 3 of the predictors in theoretical terms. Include the additional explanatory variables you have selected in a new model that also includes age and LeftRight.
  10. Provide some predicted probabilities from the model that illustrate the substantive importance of the new variables that you have added.
  11. Compare the percentage correctly predicted from the simple model that you estimated with just LeftRight and age as predictors to the percentage correctly predicted from the model with the additional explanatory variables that you included in the question above. Which has the greater explanatory power?
  12. Save your script, which should now include the answers to all the exercises.
  13. Source your script, i.e. run the entire script all at once. Fix the script if you get any error messages.