1 Introduction to Statistical Learning

Assignments for the course focus on practical aspects of the concepts covered in the lectures. Assignments are based on the material covered in James et al. You will start working on the assignment in the lab sessions after the lectures, but may need to finish them after class.

You will have four days to work on the assignments, submitting knitted HTML files with your solutions via Moodle by midnight on Saturday. We will subsequently open up our solutions to the problem sets.

If you haven’t done it yet complete Data Camp tutorials:

Data Camp R tutorials
Data Camp R Markdown tutorials. You can complete the free first chapter.
After the tutorials complete introduction to R lab session in James et al. (Chapter 2.3). That will help you tackle following exercises:

1.1 Seminar

1.1.1 Exercise

This exercise relates to the College dataset from the main course textbook (James et al 2013). You can download the dataset from https://uclspp.github.io/datasets. The dataset contains a number of variables for 777 different universities and colleges in the US.

The variables are:

Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10% of high school class
Top25perc : New students from top 25% of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data or you can load this in R directly from the website, using:

college <- read.csv("https://uclspp.github.io/datasets/data/College.csv")

Look at the data using the View() function.

View(college)

You’ll notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following:

rownames(college) <- college[, 1]

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college <- college[, -1] 
View(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

Use the summary() function to produce a numerical summary of the variables in the dataset.
Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

Continue exploring the data, and provide a brief summary of what you discover.

1.1.2 Exercise

This exercise involves the Auto dataset from the text book available that you can download from https://uclspp.github.io/datasets. Make sure that the missing values have been removed from the data. You should load that dataset as the first step of the exercise. Hint: We used the command for that in the Introduction to R session in class today. Go back and look up how to read in a csv file.

Which of the predictors are quantitative, and which are qualitative?
What is the range of each quantitative predictor? You can answer this using the range() function.
What is the mean and standard deviation of each quantitative predictor?
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
Using the full dataset, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

1.1.3 Exercise

This exercise involves the Boston housing dataset.

To begin, load in the Boston dataset. The Boston dataset is part of the MASS library in R.

library(MASS)

Now the dataset is contained in the object Boston.

Boston

Read about the dataset:

?Boston

How many rows are in this dataset? How many columns? What do the rows and columns represent?

Make some pairwise scatterplots of the predictors (columns) in this dataset. Describe your findings.
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
How many of the suburbs in this dataset bound the Charles river?
What is the median pupil-teacher ratio among the towns in this dataset?
Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
In this dataset, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.