5 Nonlinear Models and Tree-based Methods

Datathon 1

Your first datathon exercise is due in two weeks. Click on the button below for details:

5.1 Seminar

You will need to load the package from the course textbook:

library(ISLR)

5.1.1 Exercise

This question relates to the College dataset from the ISLR package.

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
Evaluate the model obtained on the test set, and explain the results obtained.
For which variables, if any, is there evidence of a non-linear relationship with the response?

5.1.2 Exercise

Apply boosting, bagging, and random forests to a dataset of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?

5.1.3 Exercise

We now use boosting to predict Salary in the Hitters dataset, which is part of the ISLR package.

Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \(\lambda\). Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding training set MSE on the \(y\)-axis.
Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding test set MSE on the \(y\)-axis.
Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in our discussions of regression models.
Which variables appear to be the most important predictors in the boosted model?
Now apply bagging to the training set. What is the test set MSE for this approach?