5.2 Solutions


5.2.1 Exercise

This question relates to the College dataset from the ISLR package.

  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(ISLR)
library(leaps)
library(gam)
library(gbm)
library(glmnet)
library(randomForest)
set.seed(12345)
train <- sample(nrow(College) * 0.7)
train_set <- College[train, ]
test_set <- College[-train, ]
forward_subset <- regsubsets(Outstate ~ ., data = train_set, nvmax = ncol(College)-1, method = "forward")
model_summary <- summary(forward_subset)

plot_metric <- function(metric, yaxis_label, reverse = FALSE) {
  plot(metric, xlab = "Number of Variables", ylab = yaxis_label, xaxt = "n", type = "l")
  axis(side = 1, at = 1:length(metric))
  
  if (reverse) {
    metric_1se <- max(metric) - (sd(metric) / sqrt(length(metric)))
    min_subset <- which(metric > metric_1se)
  } else {
    metric_1se <- min(metric) + (sd(metric) / sqrt(length(metric)))
    min_subset <- which(metric < metric_1se)
  }
  
  abline(h = metric_1se, col = "red", lty = 2)
  abline(v = min_subset[1], col = "green", lty = 2)
}

par(mfrow=c(1, 3))
  
plot_metric(model_summary$cp, "Cp")
plot_metric(model_summary$bic, "BIC")
plot_metric(model_summary$adjr2, "Adjusted R2", reverse = TRUE) # higher values are better

The red dotted line shows the best metric (Cp, BIC, Adjusted R2) within 1 standard error. The green dotted line shows the subset for best metric with the smallest number of variables. Cp and Adjusted R2 give us a subset with 7 variables and BIC gives us 6.

coef(forward_subset, 6)
  (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
-3769.0587788  2748.6944010     0.8999634    38.5143460    44.4889713 
       Expend     Grad.Rate 
    0.2543900    31.2043096 
  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
gam_model <- gam(Outstate ~ Private + s(Room.Board, df=2) + s(perc.alumni, df=2) + 
                   s(PhD, df=2) + s(Expend, df=2) + s(Grad.Rate, df=2), data=train_set)
par(mfrow=c(2, 3))
plot(gam_model, se=TRUE, col="blue")

  1. Evaluate the model obtained on the test set, and explain the results obtained.

Re-use some functions for calculating RMSE and R^2 from last week:

calc_mse <- function(y, y_hat) {
  return(mean((y - y_hat)^2))
}

calc_rmse <- function(y, y_hat) {
  return(sqrt(calc_mse(y, y_hat)))
}

calc_r2 <- function(y, y_hat) {
  y_bar <- mean(y)
  rss <- sum((y - y_hat)^2)
  tss <- sum((y - y_bar)^2)
  return(1 - (rss / tss))
}
gam_predictions <- predict(gam_model, test_set)

gam_rmse <- calc_rmse(test_set$Outstate, gam_predictions)
cat("RMSE:", gam_rmse, "\n")
RMSE: 1984.385 
gam_r2 <- calc_r2(test_set$Outstate, gam_predictions)
cat(" R^2:", gam_r2, "\n")
 R^2: 0.7614328 

We obtain a test RMSE of 1984.3845506 and R-squared of 0.7614328 using GAM with 6 predictors

  1. For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam_model)

Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(perc.alumni, 
    df = 2) + s(PhD, df = 2) + s(Expend, df = 2) + s(Grad.Rate, 
    df = 2), data = train_set)
Deviance Residuals:
      Min        1Q    Median        3Q       Max 
-7164.185 -1192.389     9.746  1195.918  8668.434 

(Dispersion Parameter for gaussian family taken to be 3515849)

    Null Deviance: 8614032615 on 542 degrees of freedom
Residual Deviance: 1866916246 on 531.0002 degrees of freedom
AIC: 9739.358 

Number of Local Scoring Iterations: 2 

Anova for Parametric Effects
                        Df     Sum Sq    Mean Sq F value    Pr(>F)    
Private                  1 1968096326 1968096326 559.778 < 2.2e-16 ***
s(Room.Board, df = 2)    1 1852547004 1852547004 526.913 < 2.2e-16 ***
s(perc.alumni, df = 2)   1  739433651  739433651 210.314 < 2.2e-16 ***
s(PhD, df = 2)           1  408910513  408910513 116.305 < 2.2e-16 ***
s(Expend, df = 2)        1  669471699  669471699 190.415 < 2.2e-16 ***
s(Grad.Rate, df = 2)     1  105813593  105813593  30.096 6.372e-08 ***
Residuals              531 1866916246    3515849                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
                       Npar Df Npar F     Pr(F)    
(Intercept)                                        
Private                                            
s(Room.Board, df = 2)        1  5.624 0.0180708 *  
s(perc.alumni, df = 2)       1  0.517 0.4725226    
s(PhD, df = 2)               1 11.780 0.0006455 ***
s(Expend, df = 2)            1 70.804 4.441e-16 ***
s(Grad.Rate, df = 2)         1  3.159 0.0761031 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-parametric Anova test shows a strong evidence of non-linear relationship between response and Expend and PhD, and a moderately strong non-linear relationship between response and Room.Board and Grad.Rate.

5.2.2 Exercise

Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?

data(Weekly)
Weekly$Direction <- Weekly$Direction == "Up"
Weekly <- subset(Weekly, select = -c(Year, Today)) # drop Year and Today columns

set.seed(12345)
train <- sample(nrow(Weekly) * 0.7)
train_set <- Weekly[train, ]
test_set <- Weekly[-train, ]
# show confusion matrix from predicted class and observed class
show_model_performance <- function(predicted_status, observed_status) {
  confusion_matrix <- table(predicted_status, 
                            observed_status, 
                            dnn = c("Predicted Status", "Observed Status"))
  print(confusion_matrix)
  
  error_rate <- mean(predicted_status != observed_status)

  cat("\n") # \n means newline so it just prints a blank line
  cat("         Error Rate:", 100 * error_rate, "%\n")
  cat("Correctly Predicted:", 100 * (1-error_rate), "%\n")
  cat("False Positive Rate:", 100 * confusion_matrix[2,1] / sum(confusion_matrix[,1]), "%\n")
  cat("False Negative Rate:", 100 * confusion_matrix[1,2] / sum(confusion_matrix[,2]), "%\n")
}

Logistic regression

logit_model <- glm(Direction ~ ., data = Weekly, family = binomial, subset = train)

predicted_direction <- predict(logit_model, test_set, type="response") > 0.5
show_model_performance(predicted_direction, test_set$Direction)
                Observed Status
Predicted Status FALSE TRUE
           FALSE   128  157
           TRUE     18   24

         Error Rate: 53.51682 %
Correctly Predicted: 46.48318 %
False Positive Rate: 12.32877 %
False Negative Rate: 86.74033 %

Boosting

boost_model <- gbm(Direction ~ ., data = train_set, n.trees=5000, distribution = "bernoulli")

predicted_direction <- predict(boost_model, test_set, n.trees=5000) > 0.5
show_model_performance(predicted_direction, test_set$Direction)
                Observed Status
Predicted Status FALSE TRUE
           FALSE   116  143
           TRUE     30   38

         Error Rate: 52.9052 %
Correctly Predicted: 47.0948 %
False Positive Rate: 20.54795 %
False Negative Rate: 79.00552 %

Bagging

Weekly$Direction <- as.factor(Weekly$Direction)
bag_model <- randomForest(Direction ~ ., data=Weekly, subset=train, mtry=ncol(Weekly)-1)

predicted_direction <- predict(bag_model, test_set) 
show_model_performance(predicted_direction, test_set$Direction)
                Observed Status
Predicted Status FALSE TRUE
           FALSE    18   23
           TRUE    128  158

         Error Rate: 46.17737 %
Correctly Predicted: 53.82263 %
False Positive Rate: 87.67123 %
False Negative Rate: 12.70718 %

Random Forests

rf_model <- randomForest(Direction ~ ., data=Weekly, subset=train, mtry=2)

predicted_direction <- predict(rf_model, test_set) 
show_model_performance(predicted_direction, test_set$Direction)
                Observed Status
Predicted Status FALSE TRUE
           FALSE    21   25
           TRUE    125  156

         Error Rate: 45.87156 %
Correctly Predicted: 54.12844 %
False Positive Rate: 85.61644 %
False Negative Rate: 13.81215 %

5.2.3 Exercise

We now use boosting to predict Salary in the Hitters dataset, which is part of the ISLR package.

  1. Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
sum(is.na(Hitters$Salary))
[1] 59
Hitters <- Hitters[-which(is.na(Hitters$Salary)), ]
sum(is.na(Hitters$Salary))
[1] 0
Hitters$Salary <- log(Hitters$Salary)
  1. Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
train <- 1:200
train_set <- Hitters[train, ]
test_set <- Hitters[-train, ]
  1. Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \(\lambda\). Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding training set MSE on the \(y\)-axis.
set.seed(12345)
lambdas <- 10 ^ seq(-10, -0.2, length = 100)
train_errors <- rep(NA, length(lambdas))
test_errors <- rep(NA, length(lambdas))

for (i in 1:length(lambdas)) {
  boost_model <- gbm(Salary ~ . , data = train_set, distribution = "gaussian", 
                       n.trees = 1000, shrinkage = lambdas[i])
  
  train_predictions <- predict(boost_model, train_set, n.trees = 1000)
  train_errors[i] <- calc_mse(train_set$Salary, train_predictions)
  
  test_predictions <- predict(boost_model, test_set, n.trees = 1000)
  test_errors[i] <- calc_mse(test_set$Salary, test_predictions)
}

plot(lambdas, train_errors, xlab = "Shrinkage", ylab = "Train MSE", col = "blue", type = "b", pch = 20)

  1. Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding test set MSE on the \(y\)-axis.
plot(lambdas, test_errors, xlab = "Shrinkage", ylab = "Test MSE", col = "red", type = "b", pch = 20)

min(test_errors)
[1] 0.256307
boost_lambda_min <- lambdas[which.min(test_errors)]
boost_lambda_min
[1] 0.05141752

Minimum test error is obtained at \(\lambda = 0.0514175\).

  1. Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in our discussions of regression models.

OLS

ols_model <- lm(Salary ~ . , data = train_set)
ols_predictions <- predict(ols_model, test_set)
calc_mse(test_set$Salary, ols_predictions)
[1] 0.4917959

Lasso

set.seed(12345)

train_matrix <- model.matrix(Salary ~ . , data = train_set)
test_matrix <- model.matrix(Salary ~ . , data = test_set)

lasso_model <- glmnet(train_matrix, train_set$Salary, alpha = 1)
lasso_predictions <- predict(lasso_model, s=0.01, newx=test_matrix)

calc_mse(test_set$Salary, lasso_predictions)
[1] 0.4700537

Both linear model and regularization like Lasso have higher test MSE than boosting.

  1. Which variables appear to be the most important predictors in the boosted model?
boost_model <- gbm(Salary ~ . , data = train_set, distribution="gaussian", 
                   n.trees = 1000, shrinkage = boost_lambda_min)
summary(boost_model)

                var    rel.inf
CAtBat       CAtBat 17.1188801
CHits         CHits  9.7853653
CWalks       CWalks  9.0182320
CRBI           CRBI  8.3641950
Years         Years  6.9200672
PutOuts     PutOuts  6.8526315
CRuns         CRuns  5.9824324
Walks         Walks  5.9709905
CHmRun       CHmRun  5.7374204
Hits           Hits  4.5490252
RBI             RBI  3.6855364
Assists     Assists  3.3276611
AtBat         AtBat  3.1707344
HmRun         HmRun  3.0619600
Errors       Errors  2.6103787
Runs           Runs  2.4470517
Division   Division  0.6477526
League       League  0.5066292
NewLeague NewLeague  0.2430562

CAtBat, CHits and CWalks are three most important variables in that order.

  1. Now apply bagging to the training set. What is the test set MSE for this approach?
set.seed(12345)
rf_model <- randomForest(Salary ~ . , data = train_set, ntree = 500, mtry = ncol(train_set)-1)
rf_predictions <- predict(rf_model, test_set)

rf_mse <- calc_mse(test_set$Salary, rf_predictions)
rf_mse
[1] 0.2291436

Test MSE for bagging is about \(0.2291436\), which is slightly lower than the best test MSE for boosting.