Consider the Gini index, classification error, and cross-entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_m1\). The x-axis should display \(\hat{p}_m1\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, \(\hat{p}_m1\) = 1 - \(\hat{p}_m2\). You could make this plot by hand, but it will be much easier to make in R.
probabilityPositive = seq(0.0, 1.0, .001)
probabilityNegative = 1 - probabilityPositive
classificationError = probabilityPositive
classificationError[probabilityPositive > 0.5] = 1 - probabilityPositive[probabilityPositive > 0.5]
gini = probabilityPositive * (1 - probabilityPositive) + probabilityNegative * (1 - probabilityNegative)
entropy = - (probabilityPositive * log2(probabilityPositive) + probabilityNegative * log2(probabilityNegative))
plot(probabilityPositive, entropy, typ = "l", xlab = "Probability( Class = Positive )", ylab = "Measure")
lines(probabilityPositive, gini, lty = "dotted", col = "green3")
lines(probabilityPositive, classificationError, lty = "dashed", col = "red")
legend("topright",
legend = c("Entropy", "Gini", "Error"),
col = c("black", "green3", "red"),
lty = c("solid", "dotted", "dashed"))
We now use boosting to predict Salary in the Hitters data set.
10(a): Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
set.seed(2^17-1)
options(width = 120)
library(ISLR)
Hitters = na.omit(Hitters)
Hitters$Salary = log(Hitters$Salary)
10(b): Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
trn = Hitters[1:200,]
tst = Hitters[201:nrow(Hitters),]
10(c): Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \(\lambda\). Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.
start.time = Sys.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
tuneGrid = expand.grid(n.trees = 1000,
interaction.depth = c(1, 2, 3),
shrinkage = c(0.001, 0.01, 0.1),
n.minobsinnode = c(5, 10))
trControl = trainControl("repeatedcv", number = 5, repeats = 10)
model.glm = train(Salary ~ ., data = trn, method = "glm", trControl = trControl)
model.glm
## Generalized Linear Model
##
## 200 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 160, 161, 159, 160, 160, 160, ...
## Resampling results:
##
## RMSE Rsquared
## 0.6448545 0.5289125
##
##
trControl = trainControl("repeatedcv", number = 5, repeats = 10,
index = model.glm$control$index)
model.gbm = train(Salary ~ ., data = trn, method = "gbm",
trControl = trControl, tuneGrid = tuneGrid, verbose = F)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
Sys.time() - start.time
## Time difference of 1.477675 mins
model.gbm
## Stochastic Gradient Boosting
##
## 200 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 160, 161, 159, 160, 160, 160, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode RMSE Rsquared
## 0.001 1 5 0.6110911 0.6989227
## 0.001 1 10 0.6107355 0.6996871
## 0.001 2 5 0.5707869 0.7448904
## 0.001 2 10 0.5674597 0.7488053
## 0.001 3 5 0.5584758 0.7517077
## 0.001 3 10 0.5520426 0.7591397
## 0.010 1 5 0.4591868 0.7512597
## 0.010 1 10 0.4537569 0.7565791
## 0.010 2 5 0.4538172 0.7550325
## 0.010 2 10 0.4446295 0.7646382
## 0.010 3 5 0.4570267 0.7515417
## 0.010 3 10 0.4468397 0.7619217
## 0.100 1 5 0.5146992 0.6960220
## 0.100 1 10 0.4974284 0.7137445
## 0.100 2 5 0.5156581 0.6943758
## 0.100 2 10 0.5152792 0.6972687
## 0.100 3 5 0.5123096 0.6977470
## 0.100 3 10 0.5143367 0.6967009
##
## Tuning parameter 'n.trees' was held constant at a value of 1000
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth = 2, shrinkage = 0.01 and n.minobsinnode
## = 10.
plot(model.gbm)
10(d): Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.
model.gbm.001 = gbm(Salary ~ ., data = trn, n.trees = 1000, interaction.depth = 2, shrinkage = 0.001, n.minobsinnode = 10)
## Distribution not specified, assuming gaussian ...
model.gbm.100 = gbm(Salary ~ ., data = trn, n.trees = 1000, interaction.depth = 2, shrinkage = 0.100, n.minobsinnode = 10)
## Distribution not specified, assuming gaussian ...
mse.gbm.001 = mean((tst$Salary - predict(model.gbm.001, tst, n.trees = 1000))^2)
mse.gbm.010 = mean((tst$Salary - predict(model.gbm, tst, n.trees = 1000))^2)
mse.gbm.100 = mean((tst$Salary - predict(model.gbm.100, tst, n.trees = 1000))^2)
plot(c(.001, .010, .100), c(mse.gbm.001, mse.gbm.010, mse.gbm.100), xlab = "Shrinkage", ylab = "MSE")
10(e): Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.
model.glmnet = train(Salary ~ ., data = trn, method = "glmnet", trControl = trControl)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
model.glm$results[,"RMSE"]
## [1] 0.6448545
min(model.glmnet$results[,"RMSE"])
## [1] 0.6275017
min(model.gbm$results[,"RMSE"])
## [1] 0.4446295
mean((tst$Salary - predict(model.glm, tst))^2)
## [1] 0.4917959
mean((tst$Salary - predict(model.glmnet, tst))^2)
## [1] 0.4541935
mean((tst$Salary - predict(model.gbm, tst))^2)
## [1] 0.2782975
10(f): Which variables appear to be the most important predictors in the boosted model?
varImp(model.gbm)
## gbm variable importance
##
## Overall
## CAtBat 100.000
## CRBI 43.135
## CWalks 42.319
## CRuns 40.303
## CHits 36.997
## Years 21.613
## CHmRun 16.986
## Walks 16.134
## PutOuts 15.060
## AtBat 12.413
## RBI 11.404
## Hits 9.693
## Assists 9.666
## HmRun 7.338
## Errors 7.184
## Runs 5.669
## NewLeagueN 1.690
## DivisionW 1.476
## LeagueN 0.000
varImp(model.gbm, scale = F)
## gbm variable importance
##
## Overall
## CAtBat 1159.031
## CRBI 504.483
## CWalks 495.083
## CRuns 471.887
## CHits 433.822
## Years 256.744
## CHmRun 203.493
## Walks 193.686
## PutOuts 181.317
## AtBat 150.850
## RBI 139.239
## Hits 119.540
## Assists 119.233
## HmRun 92.440
## Errors 90.664
## Runs 73.226
## NewLeagueN 27.421
## DivisionW 24.954
## LeagueN 7.969
importance = varImp(model.gbm, scale = F)$importance
100 * (importance - min(importance)) / diff(range(importance))
## Overall
## AtBat 12.412979
## Hits 9.692842
## HmRun 7.338475
## Runs 5.669295
## RBI 11.404235
## Walks 16.134347
## Years 21.612638
## CAtBat 100.000000
## CHits 36.996557
## CHmRun 16.986357
## CRuns 40.303490
## CRBI 43.135277
## CWalks 42.318651
## LeagueN 0.000000
## DivisionW 1.475546
## PutOuts 15.059841
## Assists 9.666148
## Errors 7.184233
## NewLeagueN 1.689900
summary(model.gbm$finalModel)
## var rel.inf
## CAtBat CAtBat 24.4259352
## CRBI CRBI 10.6316986
## CWalks CWalks 10.4336016
## CRuns CRuns 9.9447641
## CHits CHits 9.1425687
## Years Years 5.4107397
## CHmRun CHmRun 4.2884971
## Walks Walks 4.0818167
## PutOuts PutOuts 3.8211631
## AtBat AtBat 3.1790877
## RBI RBI 2.9343869
## Hits Hits 2.5192374
## Assists Assists 2.5127620
## HmRun HmRun 1.9481154
## Errors Errors 1.9106994
## Runs Runs 1.5432059
## NewLeagueN NewLeagueN 0.5778847
## DivisionW DivisionW 0.5258869
## LeagueN LeagueN 0.1679491
100 * importance / sum(importance)
## Overall
## AtBat 3.1790877
## Hits 2.5192374
## HmRun 1.9481154
## Runs 1.5432059
## RBI 2.9343869
## Walks 4.0818167
## Years 5.4107397
## CAtBat 24.4259352
## CHits 9.1425687
## CHmRun 4.2884971
## CRuns 9.9447641
## CRBI 10.6316986
## CWalks 10.4336016
## LeagueN 0.1679491
## DivisionW 0.5258869
## PutOuts 3.8211631
## Assists 2.5127620
## Errors 1.9106994
## NewLeagueN 0.5778847
10(g): Now apply bagging to the training set. What is the test set MSE for this approach?
model.rf = train(Salary ~ ., data = trn, method = "rf", trControl = trControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
model.rf
## Random Forest
##
## 200 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 160, 161, 159, 160, 160, 160, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.4420958 0.7659045
## 10 0.4466129 0.7597085
## 19 0.4518038 0.7547292
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
model.glm$results[,"RMSE"]
## [1] 0.6448545
min(model.glmnet$results[,"RMSE"])
## [1] 0.6275017
min(model.gbm$results[,"RMSE"])
## [1] 0.4446295
min(model.rf$results[,"RMSE"])
## [1] 0.4420958
mean((tst$Salary - predict(model.glm, tst))^2)
## [1] 0.4917959
mean((tst$Salary - predict(model.glmnet, tst))^2)
## [1] 0.4541935
mean((tst$Salary - predict(model.gbm, tst))^2)
## [1] 0.2782975
mean((tst$Salary - predict(model.rf, tst))^2)
## [1] 0.2152051
setwd("C:/Data/Marketing")
set.seed(2^17-1)
start.time = Sys.time()
trn_X = read.csv("trn_X.csv", header = F)
trn_y = factor(scan("trn_y.txt", what = "character"), levels = c("no", "yes"))
tst_X = read.csv("tst_X.csv", header = F)
namelist = c("age", "job", "marital", "education", "default", "balance",
"housing", "loan", "contact", "day_of_week", "month",
"duration", "campaign", "pdays", "previous", "poutcome")
colnames(trn_X) = namelist
colnames(tst_X) = namelist
trn_mm = model.matrix(~ 0 + ., data = trn_X)
trn_mm = trn_mm[,-13]
tst_mm = model.matrix(~ 0 + ., data = tst_X)
tst_mm = tst_mm[,-13]
library(caret)
CustomLogLossFunction = function(data, lev = NULL, model = NULL) {
actuals = as.integer(data$obs == "yes")
predictions = data$yes
epsilon = 1e-15
# predictions in the interval [epsilon, 1 - epsilon] to avoid log(0) = - Inf
bounded.predictions = pmin(pmax(predictions, epsilon), 1 - epsilon)
logLoss = - mean(log(actuals * bounded.predictions + (1 - actuals) * (1 - bounded.predictions)))
return(c(CustomLogLoss = logLoss))
}
trControl = trainControl("cv", number = 5, summaryFunction = CustomLogLossFunction,
classProbs = T)
model.glm = train(trn_mm, trn_y, method = "glmnet", metric = "CustomLogLoss", maximize = F,
trControl = trControl)
model.glm
## glmnet
##
## 36169 samples
## 42 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 28935, 28934, 28935, 28936, 28936
## Resampling results across tuning parameters:
##
## alpha lambda CustomLogLoss
## 0.10 0.000258182 0.2373916
## 0.10 0.002581820 0.2375670
## 0.10 0.025818199 0.2449992
## 0.55 0.000258182 0.2373835
## 0.55 0.002581820 0.2379925
## 0.55 0.025818199 0.2560206
## 1.00 0.000258182 0.2373810
## 1.00 0.002581820 0.2387474
## 1.00 0.025818199 0.2657210
##
## CustomLogLoss was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.000258182.
trControl = trainControl("cv", number = 5, summaryFunction = CustomLogLossFunction,
classProbs = T, index = model.glm$control$index)
tuneGrid = expand.grid(eta = c(0.05, 0.1),
max_depth = c(5, 6),
gamma = 0,
colsample_bytree = c(0.8, 0.9, 1.0),
min_child_weight = c(1),
subsample = c(0.8, 0.9, 1.0),
nrounds = c(250, 500))
model = train(trn_mm, trn_y, method = "xgbTree", metric = "CustomLogLoss", maximize = F,
trControl = trControl, tuneGrid = tuneGrid)
## Loading required package: xgboost
predictions = predict(model, tst_mm, typ = "prob")[,"yes"]
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "predictions.csv", quote=F, row.names = F)
model$results[order(model$results[,"CustomLogLoss"]),]
## eta max_depth gamma colsample_bytree min_child_weight subsample nrounds CustomLogLoss CustomLogLossSD
## 21 0.05 6 0 0.8 1 0.9 250 0.1961200 0.003947348
## 4 0.05 5 0 0.8 1 0.9 500 0.1962825 0.004018891
## 10 0.05 5 0 0.9 1 0.9 500 0.1962839 0.004972018
## 27 0.05 6 0 0.9 1 0.9 250 0.1965048 0.004127283
## 31 0.05 6 0 1.0 1 0.8 250 0.1965163 0.004555495
## 19 0.05 6 0 0.8 1 0.8 250 0.1965775 0.003867908
## 22 0.05 6 0 0.8 1 0.9 500 0.1966390 0.004561713
## 25 0.05 6 0 0.9 1 0.8 250 0.1966530 0.004462809
## 33 0.05 6 0 1.0 1 0.9 250 0.1967031 0.004460382
## 2 0.05 5 0 0.8 1 0.8 500 0.1968749 0.004005561
## 23 0.05 6 0 0.8 1 1.0 250 0.1969208 0.004053932
## 16 0.05 5 0 1.0 1 0.9 500 0.1970726 0.004716050
## 14 0.05 5 0 1.0 1 0.8 500 0.1972043 0.003976818
## 28 0.05 6 0 0.9 1 0.9 500 0.1972898 0.004565770
## 6 0.05 5 0 0.8 1 1.0 500 0.1973854 0.004328404
## 45 0.10 5 0 0.9 1 0.9 250 0.1974004 0.004293112
## 8 0.05 5 0 0.9 1 0.8 500 0.1974027 0.004790353
## 20 0.05 6 0 0.8 1 0.8 500 0.1974878 0.004408071
## 47 0.10 5 0 0.9 1 1.0 250 0.1975298 0.004870309
## 24 0.05 6 0 0.8 1 1.0 500 0.1975538 0.003925460
## 41 0.10 5 0 0.8 1 1.0 250 0.1976280 0.004506649
## 12 0.05 5 0 0.9 1 1.0 500 0.1976469 0.003694515
## 9 0.05 5 0 0.9 1 0.9 250 0.1977615 0.004356007
## 26 0.05 6 0 0.9 1 0.8 500 0.1977764 0.004526176
## 29 0.05 6 0 0.9 1 1.0 250 0.1977894 0.004148830
## 3 0.05 5 0 0.8 1 0.9 250 0.1978130 0.003576741
## 39 0.10 5 0 0.8 1 0.9 250 0.1979003 0.003921870
## 34 0.05 6 0 1.0 1 0.9 500 0.1979420 0.004713222
## 51 0.10 5 0 1.0 1 0.9 250 0.1979480 0.004667295
## 7 0.05 5 0 0.9 1 0.8 250 0.1980275 0.004215175
## 1 0.05 5 0 0.8 1 0.8 250 0.1980441 0.003727029
## 13 0.05 5 0 1.0 1 0.8 250 0.1980642 0.004050311
## 59 0.10 6 0 0.8 1 1.0 250 0.1981048 0.004489015
## 15 0.05 5 0 1.0 1 0.9 250 0.1981381 0.004106602
## 35 0.05 6 0 1.0 1 1.0 250 0.1981432 0.004356976
## 65 0.10 6 0 0.9 1 1.0 250 0.1981555 0.004245985
## 30 0.05 6 0 0.9 1 1.0 500 0.1981709 0.004417330
## 57 0.10 6 0 0.8 1 0.9 250 0.1982087 0.004431710
## 53 0.10 5 0 1.0 1 1.0 250 0.1982168 0.004692512
## 32 0.05 6 0 1.0 1 0.8 500 0.1982673 0.004584904
## 43 0.10 5 0 0.9 1 0.8 250 0.1983039 0.004312819
## 18 0.05 5 0 1.0 1 1.0 500 0.1983492 0.004614575
## 36 0.05 6 0 1.0 1 1.0 500 0.1985754 0.004890971
## 49 0.10 5 0 1.0 1 0.8 250 0.1986049 0.005039759
## 5 0.05 5 0 0.8 1 1.0 250 0.1986469 0.004366705
## 55 0.10 6 0 0.8 1 0.8 250 0.1986630 0.005366151
## 63 0.10 6 0 0.9 1 0.9 250 0.1986832 0.004299088
## 37 0.10 5 0 0.8 1 0.8 250 0.1988187 0.004239453
## 11 0.05 5 0 0.9 1 1.0 250 0.1989999 0.003538617
## 71 0.10 6 0 1.0 1 1.0 250 0.1990619 0.005220631
## 61 0.10 6 0 0.9 1 0.8 250 0.1990620 0.004672558
## 69 0.10 6 0 1.0 1 0.9 250 0.1990955 0.004914598
## 17 0.05 5 0 1.0 1 1.0 250 0.1993693 0.004780659
## 42 0.10 5 0 0.8 1 1.0 500 0.2003288 0.004581048
## 48 0.10 5 0 0.9 1 1.0 500 0.2003340 0.005428104
## 67 0.10 6 0 1.0 1 0.8 250 0.2007773 0.005021787
## 40 0.10 5 0 0.8 1 0.9 500 0.2013073 0.004333997
## 46 0.10 5 0 0.9 1 0.9 500 0.2014448 0.004639352
## 54 0.10 5 0 1.0 1 1.0 500 0.2016392 0.004782396
## 52 0.10 5 0 1.0 1 0.9 500 0.2017155 0.004969808
## 38 0.10 5 0 0.8 1 0.8 500 0.2025511 0.004023601
## 44 0.10 5 0 0.9 1 0.8 500 0.2029201 0.005051892
## 50 0.10 5 0 1.0 1 0.8 500 0.2036573 0.005714756
## 60 0.10 6 0 0.8 1 1.0 500 0.2040666 0.004931650
## 66 0.10 6 0 0.9 1 1.0 500 0.2046592 0.004998282
## 72 0.10 6 0 1.0 1 1.0 500 0.2056291 0.005173365
## 64 0.10 6 0 0.9 1 0.9 500 0.2056323 0.005011829
## 58 0.10 6 0 0.8 1 0.9 500 0.2057152 0.005158491
## 62 0.10 6 0 0.9 1 0.8 500 0.2063959 0.005183844
## 70 0.10 6 0 1.0 1 0.9 500 0.2064732 0.005413799
## 56 0.10 6 0 0.8 1 0.8 500 0.2069423 0.005691156
## 68 0.10 6 0 1.0 1 0.8 500 0.2100585 0.006044542
varImp(model)
## xgbTree variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## duration 100.000
## poutcomesuccess 25.341
## pdays 14.976
## day_of_week 14.670
## contactunknown 13.986
## age 13.813
## balance 11.219
## housingyes 10.864
## monthmar 4.751
## monthjun 4.509
## campaign 4.046
## monthoct 3.983
## monthjul 3.685
## monthaug 3.600
## monthmay 3.259
## monthnov 2.988
## monthfeb 2.794
## monthsep 2.654
## previous 2.393
## monthjan 1.789
varImp(model, scale = F)
## xgbTree variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## duration 0.387315
## poutcomesuccess 0.098354
## pdays 0.058236
## day_of_week 0.057054
## contactunknown 0.054406
## age 0.053736
## balance 0.043695
## housingyes 0.042322
## monthmar 0.018661
## monthjun 0.017727
## campaign 0.015933
## monthoct 0.015689
## monthjul 0.014536
## monthaug 0.014207
## monthmay 0.012887
## monthnov 0.011839
## monthfeb 0.011088
## monthsep 0.010546
## previous 0.009537
## monthjan 0.007196
importance = varImp(model, scale = F)$importance
100 * (importance - min(importance)) / diff(range(importance))
## Overall
## duration 100.00000000
## poutcomesuccess 25.34101490
## pdays 14.97569567
## day_of_week 14.67031410
## contactunknown 13.98608801
## age 13.81292352
## balance 11.21865236
## housingyes 10.86406866
## monthmar 4.75072211
## monthjun 4.50932928
## campaign 4.04586567
## monthoct 3.98287955
## monthjul 3.68504641
## monthaug 3.59983221
## monthmay 3.25896262
## monthnov 2.98817531
## monthfeb 2.79401185
## monthsep 2.65409430
## previous 2.39337984
## monthjan 1.78860879
## loanyes 1.71918505
## maritalmarried 1.12142954
## educationtertiary 1.00080508
## jobblue-collar 0.80608135
## monthdec 0.75634613
## maritalsingle 0.73217859
## contacttelephone 0.56052484
## jobstudent 0.50573787
## educationsecondary 0.48039295
## poutcomeother 0.42308689
## jobmanagement 0.41840088
## jobtechnician 0.34606502
## jobadmin. 0.29019644
## jobhousemaid 0.21837264
## jobretired 0.21406637
## educationunknown 0.19051189
## jobservices 0.09043452
## jobentrepreneur 0.07554184
## jobself-employed 0.06841997
## defaultyes 0.04029796
## jobunemployed 0.02149651
## poutcomeunknown 0.00000000
print(Sys.time() - start.time)
## Time difference of 38.95821 mins