Chapter 7, Question 5

Consider two curves, \(\hat{g}_1\) and \(\hat{g}_2\), defined by
\[\hat{g}_1 = \underset{g}{\operatorname{arg min}}\left(\sum_{i=1}^{n}\left(y_i - g\left(x_i\right)\right)^2 + \lambda\int \left[ g^\left(3\right)(x) \right]^2 dx\right)\],
\[\hat{g}_2 = \underset{g}{\operatorname{arg min}}\left(\sum_{i=1}^{n}\left(y_i - g\left(x_i\right)\right)^2 + \lambda\int \left[ g^\left(4\right)(x) \right]^2 dx\right)\],
where \(g\left(m\right)\) represents the mth derivative of \(g\).

As \(\lambda\) goes to \(\infty\), we want to make the integral of the squared third derivative of g() small for \(\hat{g}_1\); i.e. we’ll want the third derivative of g() to go to zero, which means \(\hat{g}_1\) is limited to a quadratic function.

As \(\lambda\) goes to \(\infty\), we want to make the integral of the squared fourth derivative of g() small for \(\hat{g}_2\); i.e. we’ll want the fourth derivative of g() to go to zero, which means \(\hat{g}_2\) is limited to a cubic function … which is, of course, more flexible than a quadratic function.

5(a) As \(\lambda\) \(\rightarrow\) \(\infty\), will \(\hat{g}_1\) or \(\hat{g}_2\) have the smaller training RSS?

\(\hat{g}_2\) will have the smaller training RSS, as it has more flexibility to fit the training data.

5(b) As \(\lambda\) \(\rightarrow\) \(\infty\), will \(\hat{g}_1\) or \(\hat{g}_2\) have the smaller test RSS?

We don’t know. If the additional complexity is not needed, \(\hat{g}_1\) will have the smaller test RSS (as the extra flexibility of \(\hat{g}_2\) will allow it to fit noise in the training data); however, if the additional complexity is needed, \(\hat{g}_2\) will have the smaller test RSS.

5(c) For \(\lambda\) = 0, will \(\hat{g}_1\) or \(\hat{g}_2\) have the smaller training and test RSS?

They will have the same training and test RSS, as the second term drops to zero for both.

Chapter 7, Question 10

This question relates to the College data set.

10(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

set.seed(2^17-1)
library(ISLR)
response.index = which(colnames(College) == "Outstate")
College$Private = as.integer(College$Private == "Yes")
n.tot = nrow(College)
shuffle = sample(1:n.tot)
n.trn = round(0.8 * n.tot)
trn_X = College[shuffle[1:n.trn],-response.index]
trn_y = College[shuffle[1:n.trn],response.index]
tst_X = College[shuffle[(n.trn+1):n.tot],-response.index]
tst_y = College[shuffle[(n.trn+1):n.tot],response.index]
index.outlier = which(row.names(trn_X) == "Bennington College")
trn_X = trn_X[-index.outlier,]
trn_y = trn_y[-index.outlier]

start.time = Sys.time()
library(leaps)
max.p = ncol(trn_X)
subsets = regsubsets(trn_X, trn_y, nvmax = max.p, method = "forward")

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
results = array(0, c(max.p, 3))
colnames(results) = list("Variable Count", "RMSE", "RMSE SD")
best.loss = Inf
best.subset = NA
model.glm = NA
for (i in 1:max.p) {
    selected = which(summary(subsets)$outmat[i,] == "*")
    trControl = trainControl("repeatedcv", number = 5, repeats = 30)
    model = train(as.matrix(trn_X[,selected]), trn_y, method = "glm", trControl = trControl)
    results[i,1] = i
    results[i,2] = model$results[,"RMSE"]
    results[i,3] = model$results[,"RMSESD"]
    if (results[i,2] < best.loss) {
        best.loss = results[i,2]
        best.subset = selected
        model.glm = model
    }
}
Sys.time() - start.time
## Time difference of 26.72828 secs
results[order(results[,2]),]
##       Variable Count     RMSE   RMSE SD
##  [1,]             13 1916.919 110.27078
##  [2,]             16 1928.467 121.08360
##  [3,]             15 1929.307  98.51426
##  [4,]             14 1932.964 113.34158
##  [5,]             12 1934.180 120.95794
##  [6,]             17 1936.913 108.48919
##  [7,]             10 1952.546 107.25833
##  [8,]              8 1954.368 104.00023
##  [9,]              9 1954.414 112.64903
## [10,]             11 1956.754 119.15694
## [11,]              7 1960.415 104.51674
## [12,]              6 1975.722 115.65625
## [13,]              5 2010.516 107.09283
## [14,]              4 2076.958 118.94260
## [15,]              3 2196.761 128.01232
## [16,]              2 2447.043 199.59457
## [17,]              1 2855.820 202.50988
summary(model.glm$finalModel)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5738.2  -1319.2     39.6   1261.1   5400.4  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.098e+03  8.144e+02  -1.349  0.17791    
## Private      1.866e+03  2.728e+02   6.838 1.96e-11 ***
## Apps        -3.343e-01  7.156e-02  -4.671 3.69e-06 ***
## Accept       7.654e-01  1.264e-01   6.056 2.45e-09 ***
## Top10perc    2.168e+01  7.154e+00   3.031  0.00254 ** 
## F.Undergrad -1.726e-01  3.811e-02  -4.527 7.20e-06 ***
## Room.Board   8.648e-01  8.994e-02   9.615  < 2e-16 ***
## Personal    -1.746e-01  1.211e-01  -1.442  0.14982    
## PhD          1.262e+01  8.834e+00   1.429  0.15346    
## Terminal     2.222e+01  9.747e+00   2.280  0.02297 *  
## S.F.Ratio   -8.772e+01  2.864e+01  -3.063  0.00229 ** 
## perc.alumni  3.635e+01  8.167e+00   4.451 1.02e-05 ***
## Expend       2.208e-01  2.651e-02   8.330 5.41e-16 ***
## Grad.Rate    2.687e+01  5.923e+00   4.536 6.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3519661)
## 
##     Null deviance: 9703371152  on 620  degrees of freedom
## Residual deviance: 2136434009  on 607  degrees of freedom
## AIC: 11139
## 
## Number of Fisher Scoring iterations: 2
mean((tst_y - predict(model.glm, newdata = tst_X[,best.subset]))^2)
## [1] 4678346

10(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

We use mgcv::gam (caret method = “gam”), because it automates the search for degrees of freedom (providing a reasonable model without a lot of fuss). The cross validation results indicate that the Generalized Additive Model (GAM) has lower Mean Squared Error (MSE) than the Generalized Linear Model (GLM). This is likely caused by non-linear effects associated with predictors like Expend.

set.seed(2^17-1)
trControl = trainControl("repeatedcv", number = 50, repeats = 3)
start.time = Sys.time()
model.gam = train(trn_X[,best.subset], trn_y, method = "gam", trControl = trControl)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-15. For overview type 'help("mgcv-package")'.
Sys.time() - start.time
## Time difference of 22.90797 mins
model.gam$results
##   select method     RMSE  Rsquared   RMSESD RsquaredSD
## 1  FALSE GCV.Cp 1813.664 0.8005432 437.8747  0.1110292
## 2   TRUE GCV.Cp 1863.093 0.7907566 579.5910  0.1326873
par(mfrow = c(3, 4))
plot(model.gam$finalModel)

10(c) Evaluate the model obtained on the test set, and explain the results obtained.

The mean squared error is substantially lower for the gam model (compared to the glm), which is what was predicted by the cross validation results.

mean((tst_y - predict(model.glm, newdata = tst_X[,best.subset]))^2)
## [1] 4678346
mean((tst_y - predict(model.gam, newdata = tst_X[,best.subset]))^2)
## [1] 3548246

10(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

There are 5 predictors where the effective degrees of freedom is larger than one, and the “p value” for the significance test is less than 0.05. For example, see the last row of the component plots for the GAM.

summary(model.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## .outcome ~ Private + s(perc.alumni) + s(Terminal) + s(Top10perc) + 
##     s(PhD) + s(Grad.Rate) + s(S.F.Ratio) + s(Personal) + s(Room.Board) + 
##     s(Accept) + s(Apps) + s(F.Undergrad) + s(Expend)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9286.1      216.5  42.882  < 2e-16 ***
## Private       1579.5      281.7   5.606 3.21e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(perc.alumni) 1.000  1.000 15.517 9.15e-05 ***
## s(Terminal)    1.000  1.000  1.216 0.270597    
## s(Top10perc)   1.000  1.000  3.923 0.048099 *  
## s(PhD)         1.000  1.000  0.179 0.672289    
## s(Grad.Rate)   4.840  5.977  4.119 0.000497 ***
## s(S.F.Ratio)   3.056  3.934  1.248 0.243799    
## s(Personal)    4.002  4.961  1.660 0.138985    
## s(Room.Board)  1.000  1.000 67.183 1.37e-15 ***
## s(Accept)      6.378  7.281  4.877 2.75e-05 ***
## s(Apps)        6.473  7.479  3.785 0.002109 ** 
## s(F.Undergrad) 8.716  8.968  7.550 1.56e-10 ***
## s(Expend)      4.886  6.016 20.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 109/110
## R-sq.(adj) =  0.823   Deviance explained = 83.5%
## GCV = 2.993e+06  Scale est. = 2.7744e+06  n = 621

Kaggle: https://inclass.kaggle.com/c/ml210-songs

setwd("C:/Data/songs")
trn_X = read.csv("trn_X.csv", header = F, colClasses = "numeric", nrows = 463715)
trn_y = scan("trn_y.txt")
tst_X = read.csv("tst_X.csv", header = F, colClasses = "numeric", nrows = 51630)

trn_X_scaled = scale(trn_X, center = T, scale = T)
tst_X_scaled = scale(tst_X,
                     center = attr(trn_X_scaled, 'scaled:center'),
                     scale = attr(trn_X_scaled, 'scaled:scale'))
trn_y_scaled = scale(trn_y, center = T, scale = T)
tst_y_scaled = scale(tst_y,
                     center = attr(trn_y_scaled, 'scaled:center'),
                     scale = attr(trn_y_scaled, 'scaled:scale'))

# install.packages("drat", repos = "https://cran.rstudio.com")
# drat:::addRepo("dmlc")
# install.packages("mxnet")
# I manually selected "USA (CA 1) [https]" to install the mxnet package
library(mxnet)
## Init Rcpp
set.seed(2^17-1)
start.time = Sys.time()
mx.ctx.default()
## $device
## [1] "cpu"
## 
## $device_id
## [1] 0
## 
## $device_typeid
## [1] 1
## 
## attr(,"class")
## [1] "MXContext"
model = mx.mlp(data = trn_X_scaled, label = trn_y_scaled[,1],
               activation = "relu", hidden_node = c(64, 8),
               out_activation = "rmse", out_node = 1,
               num.round = 25, dropout = 0.05, learning.rate = 0.005,
               eval.metric = mx.metric.rmse, array.layout = "rowmajor")
## Start training with 1 devices
## [1] Train-rmse=0.994770753836239
## [2] Train-rmse=0.994734635131947
## [3] Train-rmse=0.994733421499247
## [4] Train-rmse=0.994731051375813
## [5] Train-rmse=0.994725721985384
## [6] Train-rmse=0.994712740057668
## [7] Train-rmse=0.994674308952976
## [8] Train-rmse=0.99450993281246
## [9] Train-rmse=0.992518993096862
## [10] Train-rmse=0.928476486706392
## [11] Train-rmse=0.844029341654699
## [12] Train-rmse=0.831400107387829
## [13] Train-rmse=0.82553420909734
## [14] Train-rmse=0.8205184491803
## [15] Train-rmse=0.816636934950976
## [16] Train-rmse=0.813486491999772
## [17] Train-rmse=0.810143144865991
## [18] Train-rmse=0.808194093550201
## [19] Train-rmse=0.806381493488886
## [20] Train-rmse=0.804182090064129
## [21] Train-rmse=0.802801889157572
## [22] Train-rmse=0.801194429234067
## [23] Train-rmse=0.799734148731052
## [24] Train-rmse=0.798136217018226
## [25] Train-rmse=0.797003522701925
Sys.time() - start.time
## Time difference of 4.120491 mins
dim(model$arg.params$fullyconnected0_weight)
## [1] 90 64
dim(model$arg.params$fullyconnected1_weight)
## [1] 64  8
dim(model$arg.params$fullyconnected2_weight)
## [1] 8 1
predictions.scaled = predict(model, tst_X_scaled, array.layout = "rowmajor")[1,]
predictions = attr(trn_y_scaled, 'scaled:scale') * predictions.scaled + attr(trn_y_scaled, 'scaled:center')
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "predictions.csv", quote=F, row.names = F)