1 Introduction to Statistical Learning
Assignments for the course focus on practical aspects of the concepts covered in the lectures. Assignments are based on the material covered in James et al. You will start working on the assignment in the lab sessions after the lectures, but may need to finish them after class.
You will have four days to work on the assignments, submitting knitted HTML files with your solutions via Moodle by midnight on Saturday. We will subsequently open up our solutions to the problem sets.
If you haven’t done it yet complete Data Camp tutorials:
- Data Camp R tutorials
- Data Camp R Markdown tutorials. You can complete the free first chapter.
- After the tutorials complete introduction to R lab session in James et al. (Chapter 2.3). That will help you tackle following exercises:
1.1 Seminar
1.1.1 Exercise
This exercise relates to the College
dataset from the main course textbook (James et al 2013). You can download the dataset from https://uclspp.github.io/datasets. The dataset contains a number of variables for 777 different universities and colleges in the US.
The variables are:
Private
: Public/private indicatorApps
: Number of applications receivedAccept
: Number of applicants acceptedEnroll
: Number of new students enrolledTop10perc
: New students from top 10% of high school classTop25perc
: New students from top 25% of high school classF.Undergrad
: Number of full-time undergraduatesP.Undergrad
: Number of part-time undergraduatesOutstate
: Out-of-state tuitionRoom.Board
: Room and board costsBooks
: Estimated book costsPersonal
: Estimated personal spendingPhD
: Percent of faculty with Ph.D.’sTerminal
: Percent of faculty with terminal degreeS.F.Ratio
: Student/faculty ratioperc.alumni
: Percent of alumni who donateExpend
: Instructional expenditure per studentGrad.Rate
: Graduation rate
- Use the
read.csv()
function to read the data intoR
. Call the loaded datacollege
. Make sure that you have the directory set to the correct location for the data or you can load this in R directly from the website, using:
college <- read.csv("https://uclspp.github.io/datasets/data/College.csv")
- Look at the data using the
View()
function.
View(college)
You’ll notice that the first column is just the name of each university. We don’t really want R
to treat this as data. However, it may be handy to have these names for later. Try the following:
rownames(college) <- college[, 1]
You should see that there is now a row.names
column with the name of each university recorded. This means that R
has given each row a name corresponding to the appropriate university. R
will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
college <- college[, -1]
View(college)
Now you should see that the first data column is Private
. Note that another column labeled row.names
now appears before the Private
column. However, this is not a data column but rather the name that R
is giving to each row.
Use the
summary()
function to produce a numerical summary of the variables in the dataset.Use the
pairs()
function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrixA
usingA[,1:10]
.Use the
plot()
function to produce side-by-side boxplots ofOutstate
versusPrivate
.Create a new qualitative variable, called
Elite
, by binning theTop10perc
variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
Use the summary()
function to see how many elite universities there are. Now use the plot()
function to produce side-by-side boxplots of Outstate
versus Elite
.
- Use the
hist()
function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the commandpar(mfrow = c(2,2))
useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
- Continue exploring the data, and provide a brief summary of what you discover.
1.1.2 Exercise
This exercise involves the Auto
dataset from the text book available that you can download from https://uclspp.github.io/datasets. Make sure that the missing values have been removed from the data. You should load that dataset as the first step of the exercise. Hint: We used the command for that in the Introduction to R session in class today. Go back and look up how to read in a csv file.
- Which of the predictors are quantitative, and which are qualitative?
- What is the range of each quantitative predictor? You can answer this using the
range()
function. - What is the mean and standard deviation of each quantitative predictor?
- Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
- Using the full dataset, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
- Suppose that we wish to predict gas mileage (
mpg
) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predictingmpg
? Justify your answer.
1.1.3 Exercise
This exercise involves the Boston
housing dataset.
- To begin, load in the
Boston
dataset. TheBoston
dataset is part of theMASS
library inR
.
library(MASS)
Now the dataset is contained in the object Boston
.
Boston
Read about the dataset:
?Boston
How many rows are in this dataset? How many columns? What do the rows and columns represent?
Make some pairwise scatterplots of the predictors (columns) in this dataset. Describe your findings.
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
How many of the suburbs in this dataset bound the Charles river?
What is the median pupil-teacher ratio among the towns in this dataset?
Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
In this dataset, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.