Advanced Quantitative Methods

4.2 Solutions

You will need to load the core library for the course textbook:

Let’s load all the packages upfront. NOTE: we use message=FALSE to suppress some messages.

library(ISLR)
library(MASS)
library(glmnet)
library(leaps)
library(pls)

4.2.1 Exercise

In the lab session for this topic (Sections 5.3.2 and 5.3.3 in James et al.), we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in Section 5.1.5 (5.4, page 184).

Fit a logistic regression model that predicts Direction using Lag1 and Lag2.

logit_model <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(logit_model)


Call:
glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.623  -1.261   1.001   1.083   1.506  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.22122    0.06147   3.599 0.000319 ***
Lag1        -0.03872    0.02622  -1.477 0.139672    
Lag2         0.06025    0.02655   2.270 0.023232 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1488.2  on 1086  degrees of freedom
AIC: 1494.2

Number of Fisher Scoring iterations: 4

Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.

logit_model <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1,], family = binomial)
summary(logit_model)


Call:
glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly[-1, 
    ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6258  -1.2617   0.9999   1.0819   1.5071  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.22324    0.06150   3.630 0.000283 ***
Lag1        -0.03843    0.02622  -1.466 0.142683    
Lag2         0.06085    0.02656   2.291 0.021971 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1494.6  on 1087  degrees of freedom
Residual deviance: 1486.5  on 1085  degrees of freedom
AIC: 1492.5

Number of Fisher Scoring iterations: 4

Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction="Up"|Lag1, Lag2) > 0.5. Was this observation correctly classified?

Instead of writing the code all over the place that converts predicted probabilities to Up/Down, we’ll use a function similar to what we did last week.

# get prediction as Up/Down direction - same function as what we used last week
predict_glm_direction <- function(model, newdata = NULL) {
  predictions <- predict(model, newdata, type="response")
  return(ifelse(predictions < 0.5, "Down", "Up"))
}

predicted_direction <- predict_glm_direction(logit_model, Weekly[1,])

Prediction was Up, true direction was Down

Write a for loop from i=1 to i=n, where n is the number of observations in the data set, that performs each of the following steps:
1. Fit a logistic regression model using all but the i-th observation to predict Direction using Lag1 and Lag2.
2. Compute the posterior probability of the market moving up for the i-th observation.
3. Use the posterior probability for the i-th observation in order to predict whether or not the market moves up.
4. Determine whether or not an error was made in predicting the direction for the i-th observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

prediction_errors <-rep(NA, nrow(Weekly))

for (i in 1:nrow(Weekly)) {
   logit_model <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i,], family = binomial)
   
   predicted_direction <- predict_glm_direction(logit_model, Weekly[i,])

   prediction_errors[i] <- ifelse(predicted_direction != Weekly$Direction[i], 1, 0)
}

Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

mean(prediction_errors)

[1] 0.4499541

LOOCV estimates a test error rate of 0.4499541.

4.2.2 Exercise

In this exercise, we will predict the number of applications received using the other variables in the College data set.

Split the data set into a training set and a test set.

Load and split the College data.

# split the dataset into training and test set with the split_ratio. Optionally set the random seed before splitting
# returns a list of two elements: 
#    - first element is the training subset
#    - second elemet is the test subset
sample_split <- function(dataset, split_ratio, seed = NULL) {
  set.seed(seed)
  
  train_subset <- sample(nrow(dataset) * split_ratio)
  
  return(list(
    train = dataset[train_subset, ],
    test = dataset[-train_subset, ]
  ))
}

college_split <- sample_split(College, 0.7, seed = 12345)

Fit a linear model using least squares on the training set, and report the test error obtained.

ols_model <- lm(Apps ~ . , data = college_split$train)
ols_predictions <- predict(ols_model, college_split$test)

Since we need to calculate the MSE throughout this exercise, let’s write a function to do that.

calc_rmse <- function(y, y_hat) {
  return(sqrt(mean((y - y_hat)^2)))
}

Now we can calculate the MSE for the OLS model

ols_rmse <- calc_rmse(college_split$test$Apps, ols_predictions)

Test RMSE for OLS is: 1188.832

Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

Pick \(\lambda\) using training subset and report error on test subset

train_matrix <- model.matrix(Apps ~ . , data = college_split$train)
test_matrix <- model.matrix(Apps ~ . , data = college_split$test)

grid <- 10 ^ seq(4, -2, length = 100)

ridge_model <- cv.glmnet(train_matrix, college_split$train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)

ridge_predictions <- predict(ridge_model, test_matrix, s = ridge_model$lambda.min)

ridge_rmse <- calc_rmse(college_split$test$Apps, ridge_predictions)

Test RMSE for ridge model is: 1188.8145824

Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

Pick \(\lambda\) using training subset and report error on test subset

lasso_model <- cv.glmnet(train_matrix, college_split$train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)

lasso_predictions <- predict(lasso_model, test_matrix, s = lasso_model$lambda.min)

lasso_rmse <- calc_rmse(college_split$test$Apps, lasso_predictions)

Test RMSE for lasso model is: 1188.7994726

predict(lasso_model, s = lasso_model$lambda.min, type = "coefficients")

19 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept) -601.38112596
(Intercept)    .         
PrivateYes  -321.41076249
Accept         1.70685797
Enroll        -1.35810665
Top10perc     45.45111099
Top25perc    -15.88692306
F.Undergrad    0.09983028
P.Undergrad    0.01567394
Outstate      -0.09222852
Room.Board     0.11816525
Books         -0.04056782
Personal       0.05792203
PhD           -5.56419852
Terminal      -5.43103094
S.F.Ratio     21.59623616
perc.alumni    1.87814752
Expend         0.10806715
Grad.Rate      8.13596816

Fit a PCR model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

pcr_model <- pcr(Apps ~ . , data = college_split$train, scale = TRUE, validation = "CV")

validationplot(pcr_model, val.type = "MSEP")

pcr_predictions <- predict(pcr_model, college_split$test, ncomp = 10)

pcr_rmse <- calc_rmse(college_split$test$Apps, pcr_predictions)

Test RMSE for PCR model is: 1556.8207088

Fit a PLS model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

pls_model <- plsr(Apps ~ . , data = college_split$train, scale = TRUE, validation = "CV")

validationplot(pls_model, val.type = "MSEP")

pls_predictions <- predict(pls_model, college_split$test, ncomp = 10)

pls_rmse <- calc_rmse(college_split$test$Apps, pls_predictions)

Test RMSE for PLS model is: 1178.3609154

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Let’s write a function to compute the \(R^2\) so we can compare the fit for various models. Recall that the \(R^2\) is given by:

\[R^2 = \frac{\mathrm{TSS - RSS}}{\mathrm{TSS}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}}\]

calc_r2 <- function(y, y_hat) {
  y_bar <- mean(y)
  rss <- sum((y - y_hat)^2)
  tss <- sum((y - y_bar)^2)
  return(1 - (rss / tss))
}

Results for OLS, Lasso, Ridge are comparable. Lasso reduces the \(F.Undergrad\) and \(Books\) variables to zero and shrinks coefficients of other variables. Here are the test \(R^2\) for all models.

ols_r2 <- calc_r2(college_split$test$Apps, ols_predictions)
ridge_r2 <- calc_r2(college_split$test$Apps, ridge_predictions)
lasso_r2 <- calc_r2(college_split$test$Apps, lasso_predictions)
pcr_r2 <- calc_r2(college_split$test$Apps, pcr_predictions)
pls_r2 <- calc_r2(college_split$test$Apps, pls_predictions)

barplot(c(ols_r2, ridge_r2, lasso_r2, pcr_r2, pls_r2),
        names.arg = c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

The plot shows that test \(R^2\) for all models except PCR are around 0.9, with PLS having slightly higher test \(R^2\) than others. PCR has a smaller test \(R^2\) of less than 0.8. All models except PCR predict college applications with high accuracy.

4.2.3 Exercise

We will now try to predict per capita crime rate in the Boston data set.

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best subset selection

We need a function for getting predictions from a regsubsets model.

predict.regsubsets <- function(object, newdata, id, ...) {
  formula <- as.formula(object$call[[2]])
  model_matrix <- model.matrix(formula, newdata)
  coefi <- coef(object, id = id)
  model_matrix[, names(coefi)] %*% coefi
}

Now run K-fold cross-validation

set.seed(12345)

k <- 10
p <- ncol(Boston)-1
folds <- sample(rep(1:k, length = nrow(Boston)))
cv_errors <- matrix(NA, k, p)

for (i in 1:k) {
  best_fit <- regsubsets(crim ~ . , data = Boston[folds!=i,], nvmax = p)
  for (j in 1:p) {
    prediction <- predict(best_fit, Boston[folds==i, ], id = j)
    cv_errors[i,j] <-  mean((Boston$crim[folds==i] - prediction)^2)    
  }
}

cv_rmse <- sqrt(apply(cv_errors, 2, mean))
plot(cv_rmse, type = "b")

which.min(cv_rmse)

[1] 12

min(cv_rmse)

[1] 6.55376

Lasso

x <- model.matrix(crim ~ . -1, data = Boston)
cv_lasso <- cv.glmnet(x, Boston$crim, type.measure = "mse", alpha = 1)
plot(cv_lasso)

coef(cv_lasso)

14 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 1.4186414
zn          .        
indus       .        
chas        .        
nox         .        
rm          .        
age         .        
dis         .        
rad         0.2298449
tax         .        
ptratio     .        
black       .        
lstat       .        
medv        .

sqrt(cv_lasso$cvm[cv_lasso$lambda == cv_lasso$lambda.1se])

[1] 7.568515

Ridge regression

x <- model.matrix(crim ~ . -1, data = Boston)
cv_ridge <- cv.glmnet(x, Boston$crim, type.measure = "mse", alpha = 0)
plot(cv_ridge)

coef(cv_ridge)

14 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)  1.761587688
zn          -0.002865786
indus        0.026514040
chas        -0.142145708
nox          1.684835359
rm          -0.130992458
age          0.005612849
dis         -0.084630196
rad          0.039946869
tax          0.001830673
ptratio      0.063756646
black       -0.002278831
lstat        0.031682619
medv        -0.020839190

sqrt(cv_ridge$cvm[cv_ridge$lambda == cv_ridge$lambda.1se])

[1] 7.760891

PCR

pcr_model <- pcr(crim ~ . , data = Boston, scale = TRUE, validation = "CV")
summary(pcr_model)

Data:   X dimension: 506 13 
    Y dimension: 506 1
Fit method: svdpc
Number of components considered: 13

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV            8.61    7.270    7.278    6.832    6.824    6.855    6.908
adjCV         8.61    7.264    7.272    6.824    6.813    6.846    6.895
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       6.901    6.770    6.800     6.795     6.811     6.784     6.705
adjCV    6.887    6.756    6.785     6.780     6.795     6.765     6.686

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X       47.70    60.36    69.67    76.45    82.99    88.00    91.14
crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14
      8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
X       93.45    95.40     97.04     98.46     99.52     100.0
crim    42.47    42.55     42.78     43.04     44.13      45.4

13 component PCR fit has lowest CV/adjCV RMSEP.

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

See above answer for cross-validated mean squared errors of selected models.

Does your chosen model involve all of the features in the data set? Why or why not?

I would choose the 12 parameter best subset model because it had the best cross-validated RMSE, next to PCR, but it was simpler model than the 13 component PCR model.