# 5 Nonlinear Models and Tree-based Methods

#### Datathon 1

Your first datathon exercise is due in two weeks. Click on the button below for details:

## 5.1 Seminar

You will need to load the package from the course textbook:

`library(ISLR)`

### 5.1.1 Exercise

This question relates to the `College`

dataset from the `ISLR`

package.

- Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
- Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
- Evaluate the model obtained on the test set, and explain the results obtained.
- For which variables, if any, is there evidence of a non-linear relationship with the response?

### 5.1.2 Exercise

Apply boosting, bagging, and random forests to a dataset of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?

### 5.1.3 Exercise

We now use boosting to predict `Salary`

in the `Hitters`

dataset, which is part of the `ISLR`

package.

- Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
- Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
- Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \(\lambda\). Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding training set MSE on the \(y\)-axis.
- Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding test set MSE on the \(y\)-axis.
- Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in our discussions of regression models.
- Which variables appear to be the most important predictors in the boosted model?
- Now apply bagging to the training set. What is the test set MSE for this approach?