Chapter 8, Question 3

Consider the Gini index, classification error, and cross-entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_m1\). The x-axis should display \(\hat{p}_m1\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, \(\hat{p}_m1\) = 1 - \(\hat{p}_m2\). You could make this plot by hand, but it will be much easier to make in R.

probabilityPositive = seq(0.0, 1.0, .001)
probabilityNegative = 1 - probabilityPositive
classificationError = probabilityPositive
classificationError[probabilityPositive > 0.5] = 1 - probabilityPositive[probabilityPositive > 0.5]
gini = probabilityPositive * (1 - probabilityPositive) + probabilityNegative * (1 - probabilityNegative)
entropy = - (probabilityPositive * log2(probabilityPositive) + probabilityNegative * log2(probabilityNegative))
plot(probabilityPositive, entropy, typ = "l", xlab = "Probability( Class = Positive )", ylab = "Measure")
lines(probabilityPositive, gini, lty = "dotted", col = "green3")
lines(probabilityPositive, classificationError, lty = "dashed", col = "red")
legend("topright",
       legend = c("Entropy", "Gini", "Error"),
       col = c("black", "green3", "red"),
       lty = c("solid", "dotted", "dashed"))

Chapter 8, Question 10

We now use boosting to predict Salary in the Hitters data set.

10(a): Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

set.seed(2^17-1)
options(width = 120)
library(ISLR)
Hitters = na.omit(Hitters)
Hitters$Salary = log(Hitters$Salary)

10(b): Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

trn = Hitters[1:200,]
tst = Hitters[201:nrow(Hitters),]

10(c): Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \(\lambda\). Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

start.time = Sys.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
tuneGrid = expand.grid(n.trees = 1000,
                       interaction.depth = c(1, 2, 3),
                       shrinkage = c(0.001, 0.01, 0.1),
                       n.minobsinnode = c(5, 10))
trControl = trainControl("repeatedcv", number = 5, repeats = 10)
model.glm = train(Salary ~ ., data = trn, method = "glm", trControl = trControl)
model.glm
## Generalized Linear Model 
## 
## 200 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 160, 161, 159, 160, 160, 160, ... 
## Resampling results:
## 
##   RMSE       Rsquared 
##   0.6448545  0.5289125
## 
## 
trControl = trainControl("repeatedcv", number = 5, repeats = 10,
                         index = model.glm$control$index)

model.gbm = train(Salary ~ ., data = trn, method = "gbm",
                  trControl = trControl, tuneGrid = tuneGrid, verbose = F)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
Sys.time() - start.time
## Time difference of 1.477675 mins
model.gbm
## Stochastic Gradient Boosting 
## 
## 200 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 160, 161, 159, 160, 160, 160, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  RMSE       Rsquared 
##   0.001      1                   5              0.6110911  0.6989227
##   0.001      1                  10              0.6107355  0.6996871
##   0.001      2                   5              0.5707869  0.7448904
##   0.001      2                  10              0.5674597  0.7488053
##   0.001      3                   5              0.5584758  0.7517077
##   0.001      3                  10              0.5520426  0.7591397
##   0.010      1                   5              0.4591868  0.7512597
##   0.010      1                  10              0.4537569  0.7565791
##   0.010      2                   5              0.4538172  0.7550325
##   0.010      2                  10              0.4446295  0.7646382
##   0.010      3                   5              0.4570267  0.7515417
##   0.010      3                  10              0.4468397  0.7619217
##   0.100      1                   5              0.5146992  0.6960220
##   0.100      1                  10              0.4974284  0.7137445
##   0.100      2                   5              0.5156581  0.6943758
##   0.100      2                  10              0.5152792  0.6972687
##   0.100      3                   5              0.5123096  0.6977470
##   0.100      3                  10              0.5143367  0.6967009
## 
## Tuning parameter 'n.trees' was held constant at a value of 1000
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth = 2, shrinkage = 0.01 and n.minobsinnode
##  = 10.
plot(model.gbm)

10(d): Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

model.gbm.001 = gbm(Salary ~ ., data = trn, n.trees = 1000, interaction.depth = 2, shrinkage = 0.001, n.minobsinnode = 10)
## Distribution not specified, assuming gaussian ...
model.gbm.100 = gbm(Salary ~ ., data = trn, n.trees = 1000, interaction.depth = 2, shrinkage = 0.100, n.minobsinnode = 10)
## Distribution not specified, assuming gaussian ...
mse.gbm.001 = mean((tst$Salary - predict(model.gbm.001, tst, n.trees = 1000))^2)
mse.gbm.010 = mean((tst$Salary - predict(model.gbm, tst, n.trees = 1000))^2)
mse.gbm.100 = mean((tst$Salary - predict(model.gbm.100, tst, n.trees = 1000))^2)

plot(c(.001, .010, .100), c(mse.gbm.001, mse.gbm.010, mse.gbm.100), xlab = "Shrinkage", ylab = "MSE")

10(e): Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

model.glmnet = train(Salary ~ ., data = trn, method = "glmnet", trControl = trControl)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
model.glm$results[,"RMSE"]
## [1] 0.6448545
min(model.glmnet$results[,"RMSE"])
## [1] 0.6275017
min(model.gbm$results[,"RMSE"])
## [1] 0.4446295
mean((tst$Salary - predict(model.glm, tst))^2)
## [1] 0.4917959
mean((tst$Salary - predict(model.glmnet, tst))^2)
## [1] 0.4541935
mean((tst$Salary - predict(model.gbm, tst))^2)
## [1] 0.2782975

10(f): Which variables appear to be the most important predictors in the boosted model?

varImp(model.gbm)
## gbm variable importance
## 
##            Overall
## CAtBat     100.000
## CRBI        43.135
## CWalks      42.319
## CRuns       40.303
## CHits       36.997
## Years       21.613
## CHmRun      16.986
## Walks       16.134
## PutOuts     15.060
## AtBat       12.413
## RBI         11.404
## Hits         9.693
## Assists      9.666
## HmRun        7.338
## Errors       7.184
## Runs         5.669
## NewLeagueN   1.690
## DivisionW    1.476
## LeagueN      0.000
varImp(model.gbm, scale = F)
## gbm variable importance
## 
##             Overall
## CAtBat     1159.031
## CRBI        504.483
## CWalks      495.083
## CRuns       471.887
## CHits       433.822
## Years       256.744
## CHmRun      203.493
## Walks       193.686
## PutOuts     181.317
## AtBat       150.850
## RBI         139.239
## Hits        119.540
## Assists     119.233
## HmRun        92.440
## Errors       90.664
## Runs         73.226
## NewLeagueN   27.421
## DivisionW    24.954
## LeagueN       7.969
importance = varImp(model.gbm, scale = F)$importance
100 * (importance - min(importance)) / diff(range(importance))
##               Overall
## AtBat       12.412979
## Hits         9.692842
## HmRun        7.338475
## Runs         5.669295
## RBI         11.404235
## Walks       16.134347
## Years       21.612638
## CAtBat     100.000000
## CHits       36.996557
## CHmRun      16.986357
## CRuns       40.303490
## CRBI        43.135277
## CWalks      42.318651
## LeagueN      0.000000
## DivisionW    1.475546
## PutOuts     15.059841
## Assists      9.666148
## Errors       7.184233
## NewLeagueN   1.689900
summary(model.gbm$finalModel)

##                   var    rel.inf
## CAtBat         CAtBat 24.4259352
## CRBI             CRBI 10.6316986
## CWalks         CWalks 10.4336016
## CRuns           CRuns  9.9447641
## CHits           CHits  9.1425687
## Years           Years  5.4107397
## CHmRun         CHmRun  4.2884971
## Walks           Walks  4.0818167
## PutOuts       PutOuts  3.8211631
## AtBat           AtBat  3.1790877
## RBI               RBI  2.9343869
## Hits             Hits  2.5192374
## Assists       Assists  2.5127620
## HmRun           HmRun  1.9481154
## Errors         Errors  1.9106994
## Runs             Runs  1.5432059
## NewLeagueN NewLeagueN  0.5778847
## DivisionW   DivisionW  0.5258869
## LeagueN       LeagueN  0.1679491
100 * importance / sum(importance)
##               Overall
## AtBat       3.1790877
## Hits        2.5192374
## HmRun       1.9481154
## Runs        1.5432059
## RBI         2.9343869
## Walks       4.0818167
## Years       5.4107397
## CAtBat     24.4259352
## CHits       9.1425687
## CHmRun      4.2884971
## CRuns       9.9447641
## CRBI       10.6316986
## CWalks     10.4336016
## LeagueN     0.1679491
## DivisionW   0.5258869
## PutOuts     3.8211631
## Assists     2.5127620
## Errors      1.9106994
## NewLeagueN  0.5778847

10(g): Now apply bagging to the training set. What is the test set MSE for this approach?

model.rf = train(Salary ~ ., data = trn, method = "rf", trControl = trControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
model.rf
## Random Forest 
## 
## 200 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 160, 161, 159, 160, 160, 160, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##    2    0.4420958  0.7659045
##   10    0.4466129  0.7597085
##   19    0.4518038  0.7547292
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 2.
model.glm$results[,"RMSE"]
## [1] 0.6448545
min(model.glmnet$results[,"RMSE"])
## [1] 0.6275017
min(model.gbm$results[,"RMSE"])
## [1] 0.4446295
min(model.rf$results[,"RMSE"])
## [1] 0.4420958
mean((tst$Salary - predict(model.glm, tst))^2)
## [1] 0.4917959
mean((tst$Salary - predict(model.glmnet, tst))^2)
## [1] 0.4541935
mean((tst$Salary - predict(model.gbm, tst))^2)
## [1] 0.2782975
mean((tst$Salary - predict(model.rf, tst))^2)
## [1] 0.2152051

Kaggle: https://inclass.kaggle.com/c/ml210-marketing

setwd("C:/Data/Marketing")
set.seed(2^17-1)
start.time = Sys.time()

trn_X = read.csv("trn_X.csv", header = F)
trn_y = factor(scan("trn_y.txt", what = "character"), levels = c("no", "yes"))
tst_X = read.csv("tst_X.csv", header = F)

namelist = c("age", "job", "marital", "education", "default", "balance",
             "housing", "loan", "contact", "day_of_week", "month",
             "duration", "campaign", "pdays", "previous", "poutcome")

colnames(trn_X) = namelist
colnames(tst_X) = namelist

trn_mm = model.matrix(~ 0 + ., data = trn_X)
trn_mm = trn_mm[,-13]
tst_mm = model.matrix(~ 0 + ., data = tst_X)
tst_mm = tst_mm[,-13]

library(caret)
CustomLogLossFunction = function(data, lev = NULL, model = NULL) {
    actuals = as.integer(data$obs == "yes")
    predictions = data$yes
    epsilon = 1e-15
    # predictions in the interval [epsilon, 1 - epsilon] to avoid log(0) = - Inf
    bounded.predictions = pmin(pmax(predictions, epsilon), 1 - epsilon)
    logLoss = - mean(log(actuals * bounded.predictions + (1 - actuals) * (1 - bounded.predictions)))
    return(c(CustomLogLoss = logLoss))
}

trControl = trainControl("cv", number = 5, summaryFunction = CustomLogLossFunction,
                         classProbs = T)
model.glm = train(trn_mm, trn_y, method = "glmnet", metric = "CustomLogLoss", maximize = F,
                  trControl = trControl)
model.glm
## glmnet 
## 
## 36169 samples
##    42 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 28935, 28934, 28935, 28936, 28936 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       CustomLogLoss
##   0.10   0.000258182  0.2373916    
##   0.10   0.002581820  0.2375670    
##   0.10   0.025818199  0.2449992    
##   0.55   0.000258182  0.2373835    
##   0.55   0.002581820  0.2379925    
##   0.55   0.025818199  0.2560206    
##   1.00   0.000258182  0.2373810    
##   1.00   0.002581820  0.2387474    
##   1.00   0.025818199  0.2657210    
## 
## CustomLogLoss was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.000258182.
trControl = trainControl("cv", number = 5, summaryFunction = CustomLogLossFunction,
                         classProbs = T, index = model.glm$control$index)
tuneGrid = expand.grid(eta = c(0.05, 0.1),
                       max_depth = c(5, 6),
                       gamma = 0,
                       colsample_bytree = c(0.8, 0.9, 1.0),
                       min_child_weight = c(1),
                       subsample = c(0.8, 0.9, 1.0),
                       nrounds = c(250, 500))
model = train(trn_mm, trn_y, method = "xgbTree", metric = "CustomLogLoss", maximize = F,
              trControl = trControl, tuneGrid = tuneGrid)
## Loading required package: xgboost
predictions = predict(model, tst_mm, typ = "prob")[,"yes"]
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "predictions.csv", quote=F, row.names = F)

model$results[order(model$results[,"CustomLogLoss"]),]
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds CustomLogLoss CustomLogLossSD
## 21 0.05         6     0              0.8                1       0.9     250     0.1961200     0.003947348
## 4  0.05         5     0              0.8                1       0.9     500     0.1962825     0.004018891
## 10 0.05         5     0              0.9                1       0.9     500     0.1962839     0.004972018
## 27 0.05         6     0              0.9                1       0.9     250     0.1965048     0.004127283
## 31 0.05         6     0              1.0                1       0.8     250     0.1965163     0.004555495
## 19 0.05         6     0              0.8                1       0.8     250     0.1965775     0.003867908
## 22 0.05         6     0              0.8                1       0.9     500     0.1966390     0.004561713
## 25 0.05         6     0              0.9                1       0.8     250     0.1966530     0.004462809
## 33 0.05         6     0              1.0                1       0.9     250     0.1967031     0.004460382
## 2  0.05         5     0              0.8                1       0.8     500     0.1968749     0.004005561
## 23 0.05         6     0              0.8                1       1.0     250     0.1969208     0.004053932
## 16 0.05         5     0              1.0                1       0.9     500     0.1970726     0.004716050
## 14 0.05         5     0              1.0                1       0.8     500     0.1972043     0.003976818
## 28 0.05         6     0              0.9                1       0.9     500     0.1972898     0.004565770
## 6  0.05         5     0              0.8                1       1.0     500     0.1973854     0.004328404
## 45 0.10         5     0              0.9                1       0.9     250     0.1974004     0.004293112
## 8  0.05         5     0              0.9                1       0.8     500     0.1974027     0.004790353
## 20 0.05         6     0              0.8                1       0.8     500     0.1974878     0.004408071
## 47 0.10         5     0              0.9                1       1.0     250     0.1975298     0.004870309
## 24 0.05         6     0              0.8                1       1.0     500     0.1975538     0.003925460
## 41 0.10         5     0              0.8                1       1.0     250     0.1976280     0.004506649
## 12 0.05         5     0              0.9                1       1.0     500     0.1976469     0.003694515
## 9  0.05         5     0              0.9                1       0.9     250     0.1977615     0.004356007
## 26 0.05         6     0              0.9                1       0.8     500     0.1977764     0.004526176
## 29 0.05         6     0              0.9                1       1.0     250     0.1977894     0.004148830
## 3  0.05         5     0              0.8                1       0.9     250     0.1978130     0.003576741
## 39 0.10         5     0              0.8                1       0.9     250     0.1979003     0.003921870
## 34 0.05         6     0              1.0                1       0.9     500     0.1979420     0.004713222
## 51 0.10         5     0              1.0                1       0.9     250     0.1979480     0.004667295
## 7  0.05         5     0              0.9                1       0.8     250     0.1980275     0.004215175
## 1  0.05         5     0              0.8                1       0.8     250     0.1980441     0.003727029
## 13 0.05         5     0              1.0                1       0.8     250     0.1980642     0.004050311
## 59 0.10         6     0              0.8                1       1.0     250     0.1981048     0.004489015
## 15 0.05         5     0              1.0                1       0.9     250     0.1981381     0.004106602
## 35 0.05         6     0              1.0                1       1.0     250     0.1981432     0.004356976
## 65 0.10         6     0              0.9                1       1.0     250     0.1981555     0.004245985
## 30 0.05         6     0              0.9                1       1.0     500     0.1981709     0.004417330
## 57 0.10         6     0              0.8                1       0.9     250     0.1982087     0.004431710
## 53 0.10         5     0              1.0                1       1.0     250     0.1982168     0.004692512
## 32 0.05         6     0              1.0                1       0.8     500     0.1982673     0.004584904
## 43 0.10         5     0              0.9                1       0.8     250     0.1983039     0.004312819
## 18 0.05         5     0              1.0                1       1.0     500     0.1983492     0.004614575
## 36 0.05         6     0              1.0                1       1.0     500     0.1985754     0.004890971
## 49 0.10         5     0              1.0                1       0.8     250     0.1986049     0.005039759
## 5  0.05         5     0              0.8                1       1.0     250     0.1986469     0.004366705
## 55 0.10         6     0              0.8                1       0.8     250     0.1986630     0.005366151
## 63 0.10         6     0              0.9                1       0.9     250     0.1986832     0.004299088
## 37 0.10         5     0              0.8                1       0.8     250     0.1988187     0.004239453
## 11 0.05         5     0              0.9                1       1.0     250     0.1989999     0.003538617
## 71 0.10         6     0              1.0                1       1.0     250     0.1990619     0.005220631
## 61 0.10         6     0              0.9                1       0.8     250     0.1990620     0.004672558
## 69 0.10         6     0              1.0                1       0.9     250     0.1990955     0.004914598
## 17 0.05         5     0              1.0                1       1.0     250     0.1993693     0.004780659
## 42 0.10         5     0              0.8                1       1.0     500     0.2003288     0.004581048
## 48 0.10         5     0              0.9                1       1.0     500     0.2003340     0.005428104
## 67 0.10         6     0              1.0                1       0.8     250     0.2007773     0.005021787
## 40 0.10         5     0              0.8                1       0.9     500     0.2013073     0.004333997
## 46 0.10         5     0              0.9                1       0.9     500     0.2014448     0.004639352
## 54 0.10         5     0              1.0                1       1.0     500     0.2016392     0.004782396
## 52 0.10         5     0              1.0                1       0.9     500     0.2017155     0.004969808
## 38 0.10         5     0              0.8                1       0.8     500     0.2025511     0.004023601
## 44 0.10         5     0              0.9                1       0.8     500     0.2029201     0.005051892
## 50 0.10         5     0              1.0                1       0.8     500     0.2036573     0.005714756
## 60 0.10         6     0              0.8                1       1.0     500     0.2040666     0.004931650
## 66 0.10         6     0              0.9                1       1.0     500     0.2046592     0.004998282
## 72 0.10         6     0              1.0                1       1.0     500     0.2056291     0.005173365
## 64 0.10         6     0              0.9                1       0.9     500     0.2056323     0.005011829
## 58 0.10         6     0              0.8                1       0.9     500     0.2057152     0.005158491
## 62 0.10         6     0              0.9                1       0.8     500     0.2063959     0.005183844
## 70 0.10         6     0              1.0                1       0.9     500     0.2064732     0.005413799
## 56 0.10         6     0              0.8                1       0.8     500     0.2069423     0.005691156
## 68 0.10         6     0              1.0                1       0.8     500     0.2100585     0.006044542
varImp(model)
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                 Overall
## duration        100.000
## poutcomesuccess  25.341
## pdays            14.976
## day_of_week      14.670
## contactunknown   13.986
## age              13.813
## balance          11.219
## housingyes       10.864
## monthmar          4.751
## monthjun          4.509
## campaign          4.046
## monthoct          3.983
## monthjul          3.685
## monthaug          3.600
## monthmay          3.259
## monthnov          2.988
## monthfeb          2.794
## monthsep          2.654
## previous          2.393
## monthjan          1.789
varImp(model, scale = F)
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                  Overall
## duration        0.387315
## poutcomesuccess 0.098354
## pdays           0.058236
## day_of_week     0.057054
## contactunknown  0.054406
## age             0.053736
## balance         0.043695
## housingyes      0.042322
## monthmar        0.018661
## monthjun        0.017727
## campaign        0.015933
## monthoct        0.015689
## monthjul        0.014536
## monthaug        0.014207
## monthmay        0.012887
## monthnov        0.011839
## monthfeb        0.011088
## monthsep        0.010546
## previous        0.009537
## monthjan        0.007196
importance = varImp(model, scale = F)$importance
100 * (importance - min(importance)) / diff(range(importance))
##                         Overall
## duration           100.00000000
## poutcomesuccess     25.34101490
## pdays               14.97569567
## day_of_week         14.67031410
## contactunknown      13.98608801
## age                 13.81292352
## balance             11.21865236
## housingyes          10.86406866
## monthmar             4.75072211
## monthjun             4.50932928
## campaign             4.04586567
## monthoct             3.98287955
## monthjul             3.68504641
## monthaug             3.59983221
## monthmay             3.25896262
## monthnov             2.98817531
## monthfeb             2.79401185
## monthsep             2.65409430
## previous             2.39337984
## monthjan             1.78860879
## loanyes              1.71918505
## maritalmarried       1.12142954
## educationtertiary    1.00080508
## jobblue-collar       0.80608135
## monthdec             0.75634613
## maritalsingle        0.73217859
## contacttelephone     0.56052484
## jobstudent           0.50573787
## educationsecondary   0.48039295
## poutcomeother        0.42308689
## jobmanagement        0.41840088
## jobtechnician        0.34606502
## jobadmin.            0.29019644
## jobhousemaid         0.21837264
## jobretired           0.21406637
## educationunknown     0.19051189
## jobservices          0.09043452
## jobentrepreneur      0.07554184
## jobself-employed     0.06841997
## defaultyes           0.04029796
## jobunemployed        0.02149651
## poutcomeunknown      0.00000000
print(Sys.time() - start.time)
## Time difference of 38.95821 mins