Homework #8

Chapter 9, Question 3

Here we explore the maximal margin classifier on a toy data set.

3(a) We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class label.

Obs.	\(X_1\)	\(X_2\)	Y
1	3	4	Red
2	2	2	Red
3	4	4	Red
4	1	4	Red
5	2	1	Blue
6	4	3	Blue
7	4	1	Blue

Sketch the observations.

X = matrix(c(3, 4, 2, 2, 4, 4, 1, 4, 2, 1, 4, 3, 4, 1),
           nrow = 7, byrow = T)
y = c(1, 1, 1, 1, 0, 0, 0)
plot(X, col = c(rep("red", 4), rep("blue", 3)),
        pch = c(rep(16, 4), rep(17, 3)),
        xlim = c(0, 5), ylim = c(0, 5))

3(b) Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form (9.1)).

Using the points (2, 1.5) and (4, 3.5), we derive the slope as (3.5 - 1.5) / (4 - 2) = 2 / 2 = 1.

Using the point (4, 3.5), we derive the intercept as follows …
y - 3.5 = slope * (x - 4)
y - 3.5 = 1 * (x - 4)
y - 3.5 = x - 4
y = x - 4 + 3.5
y = x - 0.5

plot(X, col = c(rep("red", 4), rep("blue", 3)),
        pch = c(rep(16, 4), rep(17, 3)),
        xlim = c(0, 5), ylim = c(0, 5))
slope = 1
intercept = - 0.5
abline(intercept, slope)

beta = c(-intercept, solve(matrix(c(2, 1.5, 4, 3.5), nrow = 2, byrow = T), c(intercept, intercept)))
beta

## [1]  0.5 -1.0  1.0

c(1, 2, 1.5) %*% beta

##      [,1]
## [1,]    0

c(1, 4, 3.5) %*% beta

##      [,1]
## [1,]    0

3(c) Describe the classification rule for the maximal margin classifier. It should be something along the lines of “Classify to Red if \({beta}_0\) + \({beta}_1\) * \(X_1\) + \({beta}_2\) * \(X_2\) > 0, and classify to Blue otherwise.” Provide the values for \({beta}_0\), \({beta}_1\), and \({beta}_2\).

Classify to Red if \({beta}_0\) + \({beta}_1\) * \(X_1\) + \({beta}_2\) * \(X_2\) > 0, and classify to Blue otherwise.

\({beta}_0\) = 0.5
\({beta}_1\) = -1
\({beta}_2\) = 1

3(d) On your sketch, indicate the margin for the maximal margin hyperplane.

c(1, 2, 2) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])

##           [,1]
## [1,] 0.3535534

c(1, 4, 4) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])

##           [,1]
## [1,] 0.3535534

c(1, 2, 1) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])

##            [,1]
## [1,] -0.3535534

c(1, 4, 3) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])

##            [,1]
## [1,] -0.3535534

library(MASS)
eqscplot(X, col = c(rep("red", 4), rep("blue", 3)),
        pch = c(rep(16, 4), rep(17, 3)),
        xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
point = c(2,2)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
      c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
      col = "red", lty = "solid")
point = c(4,4)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
      c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
      col = "red", lty = "solid")
point = c(2,1)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
      c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
      col = "blue", lty = "solid")
point = c(4,3)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
      c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
      col = "blue", lty = "solid")

3(e) Indicate the support vectors for the maximal margin classifier.

plot(X, col = c(rep("red", 4), rep("blue", 3)),
        pch = c(rep(16, 4), rep(17, 3)),
        xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
points(c(2, 4), c(2, 4), pch = 8, col = "red")
points(c(2, 4), c(1, 3), pch = 8, col = "blue")

3(f) Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.

As long as the seventh observation does not get closer to the boundary than the existing support vectors (or move to the wrong side of the boundary), moving the seventh observation will not affect the maximal margin hyperplane.

3(g) Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.

Using the points (1, 0) and (5, 5), we derive the slope as (5 - 0) / (5 - 1) = 5 / 4 = 1.25.

Using the point (5, 5), we derive the intercept as follows …
y - 5 = slope * (x - 5)
y - 5 = 1.25 * (x - 5)
y - 5 = 1.25 * x - 6.25
y = 1.25 * x - 1.25

plot(X, col = c(rep("red", 4), rep("blue", 3)),
        pch = c(rep(16, 4), rep(17, 3)),
        xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
points(c(2, 4), c(2, 4), pch = 8, col = "red")
points(c(2, 4), c(1, 3), pch = 8, col = "blue")
slope2 = 1.25
intercept2 = - 1.25
abline(intercept2, slope2, lty = "dotted", col = "red")

beta2 = c(-intercept2, solve(matrix(c(2, 1.25, 4, 3.75), nrow = 2, byrow = T), c(intercept2, intercept2)))
beta2

## [1]  1.25 -1.25  1.00

c(1, 4, 4) %*% beta2 / sqrt(t(beta2[2:3]) %*% beta2[2:3])

##           [,1]
## [1,] 0.1561738

c(1, 2, 1) %*% beta2 / sqrt(t(beta2[2:3]) %*% beta2[2:3])

##            [,1]
## [1,] -0.1561738

3(h) Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.

plot(X, col = c(rep("red", 4), rep("blue", 3)),
        pch = c(rep(16, 4), rep(17, 3)),
        xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
points(c(2, 4), c(2, 4), pch = 8, col = "red")
points(c(2, 4), c(1, 3), pch = 8, col = "blue")
abline(intercept2, slope2, lty = "dotted", col = "red")
points(4, 2, col = "red", pch = 16)

Chapter 9, Question 8

This problem involves the OJ data set which is part of the ISLR package.

8(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(123)
library(ISLR)
library(e1071)
index = sample(1:nrow(OJ))
trn = OJ[index[1:800],]
tst = OJ[index[801:length(index)],]

8(b) Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

model1 = svm(Purchase ~ ., data = trn, cost = 0.01)
summary(model1)

## 
## Call:
## svm(formula = Purchase ~ ., data = trn, cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  635
## 
##  ( 316 319 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

8(c) What are the training and test error rates?

mean(predict(model1, trn) != trn$Purchase)

## [1] 0.395

mean(predict(model1, tst) != tst$Purchase)

## [1] 0.3740741

8(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

trControl = trainControl("repeatedcv", number = 10, repeats = 5)
model2 = train(Purchase ~ ., data = trn, method = "svmLinear", trControl = trControl,
               tuneGrid = data.frame(C = c(0.01, 0.1, 1, 10)))

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

model2

## Support Vector Machines with Linear Kernel 
## 
## 800 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 721, 720, 720, 721, 720, 720, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##    0.01  0.8265472  0.6318991
##    0.10  0.8300099  0.6400439
##    1.00  0.8297537  0.6392243
##   10.00  0.8307412  0.6410906
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was C = 10.

trControl = trainControl("repeatedcv", number = 10, repeats = 5, index = model2$control$index)

8(e) Compute the training and test error rates using this new value for cost.

mean(predict(model2, trn) != trn$Purchase)

## [1] 0.15875

mean(predict(model2, tst) != tst$Purchase)

## [1] 0.1666667

8(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

model3 = train(Purchase ~ ., data = trn, method = "svmRadial", trControl = trControl,
               tuneGrid = expand.grid(sigma = model1$gamma, C = c(0.01, 0.1, 1, 10)))
model3

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 800 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 721, 720, 720, 721, 720, 720, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##    0.01  0.6050080  0.0000000
##    0.10  0.8120679  0.5964606
##    1.00  0.8210157  0.6177806
##   10.00  0.8112277  0.5963457
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05555556
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.05555556 and C = 1.

mean(predict(model3, trn) != trn$Purchase)

## [1] 0.15625

mean(predict(model3, tst) != tst$Purchase)

## [1] 0.1703704

8(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.

model4 = train(Purchase ~ ., data = trn, method = "svmPoly", trControl = trControl,
               tuneGrid = expand.grid(scale = c(0.001, 0.01, 0.1), degree = 2, C = c(0.01, 0.1, 1, 10)))
model4

## Support Vector Machines with Polynomial Kernel 
## 
## 800 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 721, 720, 720, 721, 720, 720, ... 
## Resampling results across tuning parameters:
## 
##   scale  C      Accuracy   Kappa    
##   0.001   0.01  0.6050080  0.0000000
##   0.001   0.10  0.6050080  0.0000000
##   0.001   1.00  0.8085428  0.5849145
##   0.001  10.00  0.8285317  0.6370273
##   0.010   0.01  0.6050080  0.0000000
##   0.010   0.10  0.8062990  0.5795615
##   0.010   1.00  0.8270409  0.6333583
##   0.010  10.00  0.8232532  0.6250304
##   0.100   0.01  0.8042862  0.5737344
##   0.100   0.10  0.8215221  0.6201319
##   0.100   1.00  0.8142523  0.6042239
##   0.100  10.00  0.8079802  0.5911289
## 
## Tuning parameter 'degree' was held constant at a value of 2
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were degree = 2, scale = 0.001 and C
##  = 10.

mean(predict(model4, trn) != trn$Purchase)

## [1] 0.16375

mean(predict(model4, tst) != tst$Purchase)

## [1] 0.1703704

8(h) Overall, which approach seems to give the best results on this data?

svmLinear2 with cost = 10 gave the best test set performance.

Kaggle: https://inclass.kaggle.com/c/ml210-reuters

start.time = Sys.time()

setwd("C:/Data/Reuters")
set.seed(2^17-1)
library(e1071)
trn = read.matrix.csr("trn.dat")
tst = read.matrix.csr("tst.dat")
y = factor(trn$y, levels = c(0, 1), labels = c("no", "yes"))
model1 = svm(trn$x, y, kernel = "linear", cost = 1.0, probability = T)
predictions1 = predict(model1, newdata = tst$x, probability = T)
probabilities1 = attributes(predictions1)$probabilities[,"yes"]

stop.time = Sys.time()
print(stop.time - start.time)

## Time difference of 17.84765 mins

library(xgboost)
trn = xgb.DMatrix("trn.dat")

## [21:36:00] 20242x47237 matrix with 1498952 entries loaded from trn.dat

tst = xgb.DMatrix("tst.dat")

## [21:36:00] 40484x47237 matrix with 2982217 entries loaded from tst.dat

model2 = xgboost(trn, params = list(objective = "binary:logistic", eta = 0.1, max.depth = 16, eval_metric = "auc"), nround = 300)

## [1]  train-auc:0.970756 
## [2]  train-auc:0.978411 
## [3]  train-auc:0.982390 
## [4]  train-auc:0.985174 
## [5]  train-auc:0.988158 
## [6]  train-auc:0.989377 
## [7]  train-auc:0.990673 
## [8]  train-auc:0.991882 
## [9]  train-auc:0.993226 
## [10] train-auc:0.993890 
## [11] train-auc:0.994734 
## [12] train-auc:0.995348 
## [13] train-auc:0.995823 
## [14] train-auc:0.996375 
## [15] train-auc:0.996709 
## [16] train-auc:0.997049 
## [17] train-auc:0.997386 
## [18] train-auc:0.997648 
## [19] train-auc:0.997889 
## [20] train-auc:0.998102 
## [21] train-auc:0.998282 
## [22] train-auc:0.998447 
## [23] train-auc:0.998629 
## [24] train-auc:0.998744 
## [25] train-auc:0.998874 
## [26] train-auc:0.998984 
## [27] train-auc:0.999075 
## [28] train-auc:0.999129 
## [29] train-auc:0.999178 
## [30] train-auc:0.999244 
## [31] train-auc:0.999300 
## [32] train-auc:0.999363 
## [33] train-auc:0.999409 
## [34] train-auc:0.999453 
## [35] train-auc:0.999483 
## [36] train-auc:0.999508 
## [37] train-auc:0.999548 
## [38] train-auc:0.999571 
## [39] train-auc:0.999595 
## [40] train-auc:0.999614 
## [41] train-auc:0.999641 
## [42] train-auc:0.999658 
## [43] train-auc:0.999668 
## [44] train-auc:0.999690 
## [45] train-auc:0.999717 
## [46] train-auc:0.999740 
## [47] train-auc:0.999752 
## [48] train-auc:0.999761 
## [49] train-auc:0.999776 
## [50] train-auc:0.999785 
## [51] train-auc:0.999796 
## [52] train-auc:0.999808 
## [53] train-auc:0.999820 
## [54] train-auc:0.999830 
## [55] train-auc:0.999846 
## [56] train-auc:0.999852 
## [57] train-auc:0.999860 
## [58] train-auc:0.999867 
## [59] train-auc:0.999879 
## [60] train-auc:0.999886 
## [61] train-auc:0.999892 
## [62] train-auc:0.999899 
## [63] train-auc:0.999906 
## [64] train-auc:0.999909 
## [65] train-auc:0.999914 
## [66] train-auc:0.999918 
## [67] train-auc:0.999921 
## [68] train-auc:0.999924 
## [69] train-auc:0.999929 
## [70] train-auc:0.999932 
## [71] train-auc:0.999935 
## [72] train-auc:0.999938 
## [73] train-auc:0.999940 
## [74] train-auc:0.999944 
## [75] train-auc:0.999947 
## [76] train-auc:0.999949 
## [77] train-auc:0.999951 
## [78] train-auc:0.999952 
## [79] train-auc:0.999954 
## [80] train-auc:0.999956 
## [81] train-auc:0.999958 
## [82] train-auc:0.999959 
## [83] train-auc:0.999961 
## [84] train-auc:0.999963 
## [85] train-auc:0.999964 
## [86] train-auc:0.999965 
## [87] train-auc:0.999967 
## [88] train-auc:0.999968 
## [89] train-auc:0.999971 
## [90] train-auc:0.999973 
## [91] train-auc:0.999974 
## [92] train-auc:0.999976 
## [93] train-auc:0.999976 
## [94] train-auc:0.999978 
## [95] train-auc:0.999978 
## [96] train-auc:0.999979 
## [97] train-auc:0.999980 
## [98] train-auc:0.999981 
## [99] train-auc:0.999981 
## [100]    train-auc:0.999983 
## [101]    train-auc:0.999983 
## [102]    train-auc:0.999984 
## [103]    train-auc:0.999985 
## [104]    train-auc:0.999986 
## [105]    train-auc:0.999987 
## [106]    train-auc:0.999988 
## [107]    train-auc:0.999988 
## [108]    train-auc:0.999989 
## [109]    train-auc:0.999989 
## [110]    train-auc:0.999989 
## [111]    train-auc:0.999990 
## [112]    train-auc:0.999990 
## [113]    train-auc:0.999991 
## [114]    train-auc:0.999991 
## [115]    train-auc:0.999991 
## [116]    train-auc:0.999991 
## [117]    train-auc:0.999992 
## [118]    train-auc:0.999992 
## [119]    train-auc:0.999993 
## [120]    train-auc:0.999993 
## [121]    train-auc:0.999993 
## [122]    train-auc:0.999993 
## [123]    train-auc:0.999994 
## [124]    train-auc:0.999994 
## [125]    train-auc:0.999994 
## [126]    train-auc:0.999994 
## [127]    train-auc:0.999994 
## [128]    train-auc:0.999994 
## [129]    train-auc:0.999994 
## [130]    train-auc:0.999995 
## [131]    train-auc:0.999995 
## [132]    train-auc:0.999995 
## [133]    train-auc:0.999995 
## [134]    train-auc:0.999995 
## [135]    train-auc:0.999995 
## [136]    train-auc:0.999996 
## [137]    train-auc:0.999996 
## [138]    train-auc:0.999996 
## [139]    train-auc:0.999996 
## [140]    train-auc:0.999996 
## [141]    train-auc:0.999996 
## [142]    train-auc:0.999996 
## [143]    train-auc:0.999996 
## [144]    train-auc:0.999996 
## [145]    train-auc:0.999996 
## [146]    train-auc:0.999997 
## [147]    train-auc:0.999997 
## [148]    train-auc:0.999997 
## [149]    train-auc:0.999997 
## [150]    train-auc:0.999997 
## [151]    train-auc:0.999997 
## [152]    train-auc:0.999997 
## [153]    train-auc:0.999997 
## [154]    train-auc:0.999997 
## [155]    train-auc:0.999997 
## [156]    train-auc:0.999997 
## [157]    train-auc:0.999997 
## [158]    train-auc:0.999997 
## [159]    train-auc:0.999998 
## [160]    train-auc:0.999998 
## [161]    train-auc:0.999998 
## [162]    train-auc:0.999998 
## [163]    train-auc:0.999998 
## [164]    train-auc:0.999998 
## [165]    train-auc:0.999998 
## [166]    train-auc:0.999998 
## [167]    train-auc:0.999998 
## [168]    train-auc:0.999998 
## [169]    train-auc:0.999998 
## [170]    train-auc:0.999998 
## [171]    train-auc:0.999998 
## [172]    train-auc:0.999998 
## [173]    train-auc:0.999998 
## [174]    train-auc:0.999998 
## [175]    train-auc:0.999998 
## [176]    train-auc:0.999998 
## [177]    train-auc:0.999998 
## [178]    train-auc:0.999998 
## [179]    train-auc:0.999998 
## [180]    train-auc:0.999999 
## [181]    train-auc:0.999999 
## [182]    train-auc:0.999999 
## [183]    train-auc:0.999999 
## [184]    train-auc:0.999999 
## [185]    train-auc:0.999999 
## [186]    train-auc:0.999999 
## [187]    train-auc:0.999999 
## [188]    train-auc:0.999999 
## [189]    train-auc:0.999999 
## [190]    train-auc:0.999999 
## [191]    train-auc:0.999999 
## [192]    train-auc:0.999999 
## [193]    train-auc:0.999999 
## [194]    train-auc:0.999999 
## [195]    train-auc:0.999999 
## [196]    train-auc:0.999999 
## [197]    train-auc:0.999999 
## [198]    train-auc:0.999999 
## [199]    train-auc:0.999999 
## [200]    train-auc:0.999999 
## [201]    train-auc:0.999999 
## [202]    train-auc:0.999999 
## [203]    train-auc:0.999999 
## [204]    train-auc:0.999999 
## [205]    train-auc:0.999999 
## [206]    train-auc:0.999999 
## [207]    train-auc:0.999999 
## [208]    train-auc:0.999999 
## [209]    train-auc:0.999999 
## [210]    train-auc:0.999999 
## [211]    train-auc:0.999999 
## [212]    train-auc:0.999999 
## [213]    train-auc:0.999999 
## [214]    train-auc:0.999999 
## [215]    train-auc:0.999999 
## [216]    train-auc:0.999999 
## [217]    train-auc:0.999999 
## [218]    train-auc:0.999999 
## [219]    train-auc:0.999999 
## [220]    train-auc:0.999999 
## [221]    train-auc:0.999999 
## [222]    train-auc:0.999999 
## [223]    train-auc:0.999999 
## [224]    train-auc:0.999999 
## [225]    train-auc:0.999999 
## [226]    train-auc:0.999999 
## [227]    train-auc:0.999999 
## [228]    train-auc:0.999999 
## [229]    train-auc:0.999999 
## [230]    train-auc:0.999999 
## [231]    train-auc:0.999999 
## [232]    train-auc:0.999999 
## [233]    train-auc:0.999999 
## [234]    train-auc:0.999999 
## [235]    train-auc:0.999999 
## [236]    train-auc:0.999999 
## [237]    train-auc:0.999999 
## [238]    train-auc:0.999999 
## [239]    train-auc:0.999999 
## [240]    train-auc:0.999999 
## [241]    train-auc:0.999999 
## [242]    train-auc:0.999999 
## [243]    train-auc:0.999999 
## [244]    train-auc:0.999999 
## [245]    train-auc:0.999999 
## [246]    train-auc:0.999999 
## [247]    train-auc:0.999999 
## [248]    train-auc:0.999999 
## [249]    train-auc:0.999999 
## [250]    train-auc:0.999999 
## [251]    train-auc:0.999999 
## [252]    train-auc:0.999999 
## [253]    train-auc:0.999999 
## [254]    train-auc:0.999999 
## [255]    train-auc:0.999999 
## [256]    train-auc:0.999999 
## [257]    train-auc:0.999999 
## [258]    train-auc:0.999999 
## [259]    train-auc:0.999999 
## [260]    train-auc:0.999999 
## [261]    train-auc:0.999999 
## [262]    train-auc:0.999999 
## [263]    train-auc:0.999999 
## [264]    train-auc:0.999999 
## [265]    train-auc:0.999999 
## [266]    train-auc:0.999999 
## [267]    train-auc:0.999999 
## [268]    train-auc:0.999999 
## [269]    train-auc:0.999999 
## [270]    train-auc:0.999999 
## [271]    train-auc:0.999999 
## [272]    train-auc:0.999999 
## [273]    train-auc:0.999999 
## [274]    train-auc:0.999999 
## [275]    train-auc:0.999999 
## [276]    train-auc:0.999999 
## [277]    train-auc:0.999999 
## [278]    train-auc:0.999999 
## [279]    train-auc:0.999999 
## [280]    train-auc:0.999999 
## [281]    train-auc:0.999999 
## [282]    train-auc:0.999999 
## [283]    train-auc:0.999999 
## [284]    train-auc:0.999999 
## [285]    train-auc:0.999999 
## [286]    train-auc:0.999999 
## [287]    train-auc:0.999999 
## [288]    train-auc:0.999999 
## [289]    train-auc:0.999999 
## [290]    train-auc:0.999999 
## [291]    train-auc:0.999999 
## [292]    train-auc:0.999999 
## [293]    train-auc:0.999999 
## [294]    train-auc:0.999999 
## [295]    train-auc:0.999999 
## [296]    train-auc:0.999999 
## [297]    train-auc:0.999999 
## [298]    train-auc:0.999999 
## [299]    train-auc:0.999999 
## [300]    train-auc:0.999999

predictions2 = predict(model2, tst)
probabilities2 = predictions2

stop.time = Sys.time()
print(stop.time - start.time)

## Time difference of 19.45295 mins

library(e1071)
# preprocessor: http://cross-entropy.net/ML210/FilterSparseData.py.txt
trn = read.matrix.csr("trn.new")
library(SparseM)

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

trn_X = as.matrix(trn$x)
y = factor(trn$y, levels = c(0, 1), labels = c("no", "yes"))
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

model3 = randomForest(trn_X, y, ntree = 150, sampsize = c(2000, 2000), do.trace = 1)

## ntree      OOB      1      2
##     1:  20.06% 20.34% 19.81%
##     2:  19.24% 20.10% 18.44%
##     3:  15.83% 15.45% 16.18%
##     4:  14.45% 13.75% 15.10%
##     5:  12.35% 11.49% 13.16%
##     6:  11.14% 10.55% 11.68%
##     7:  10.15%  9.29% 10.95%
##     8:   9.39%  8.65% 10.08%
##     9:   8.73%  8.05%  9.37%
##    10:   8.41%  7.71%  9.06%
##    11:   7.76%  7.29%  8.20%
##    12:   7.36%  6.84%  7.84%
##    13:   7.25%  6.72%  7.75%
##    14:   6.97%  6.32%  7.57%
##    15:   6.84%  6.14%  7.49%
##    16:   6.79%  6.03%  7.49%
##    17:   6.58%  5.94%  7.18%
##    18:   6.49%  6.06%  6.88%
##    19:   6.31%  5.72%  6.85%
##    20:   6.10%  5.58%  6.58%
##    21:   6.13%  5.59%  6.63%
##    22:   5.96%  5.57%  6.33%
##    23:   5.95%  5.64%  6.23%
##    24:   5.82%  5.65%  5.99%
##    25:   5.76%  5.54%  5.97%
##    26:   5.75%  5.41%  6.06%
##    27:   5.70%  5.39%  5.98%
##    28:   5.60%  5.21%  5.97%
##    29:   5.58%  5.20%  5.93%
##    30:   5.50%  5.04%  5.93%
##    31:   5.36%  4.93%  5.76%
##    32:   5.33%  4.90%  5.73%
##    33:   5.25%  4.76%  5.71%
##    34:   5.16%  4.71%  5.58%
##    35:   5.16%  4.61%  5.66%
##    36:   5.09%  4.56%  5.58%
##    37:   5.08%  4.45%  5.66%
##    38:   5.07%  4.39%  5.71%
##    39:   5.10%  4.51%  5.65%
##    40:   5.03%  4.42%  5.60%
##    41:   5.04%  4.36%  5.67%
##    42:   5.01%  4.29%  5.68%
##    43:   4.97%  4.12%  5.77%
##    44:   4.97%  4.11%  5.77%
##    45:   4.88%  4.15%  5.56%
##    46:   4.80%  4.08%  5.47%
##    47:   4.83%  4.13%  5.47%
##    48:   4.83%  4.04%  5.57%
##    49:   4.86%  4.03%  5.62%
##    50:   4.80%  3.94%  5.60%
##    51:   4.75%  3.92%  5.53%
##    52:   4.80%  3.92%  5.61%
##    53:   4.69%  3.85%  5.47%
##    54:   4.65%  3.71%  5.52%
##    55:   4.66%  3.75%  5.50%
##    56:   4.66%  3.74%  5.52%
##    57:   4.65%  3.76%  5.47%
##    58:   4.59%  3.69%  5.42%
##    59:   4.60%  3.68%  5.46%
##    60:   4.56%  3.66%  5.40%
##    61:   4.51%  3.56%  5.39%
##    62:   4.50%  3.57%  5.37%
##    63:   4.47%  3.52%  5.36%
##    64:   4.48%  3.50%  5.39%
##    65:   4.45%  3.51%  5.32%
##    66:   4.46%  3.53%  5.33%
##    67:   4.51%  3.59%  5.37%
##    68:   4.47%  3.61%  5.26%
##    69:   4.50%  3.68%  5.26%
##    70:   4.48%  3.70%  5.19%
##    71:   4.45%  3.66%  5.19%
##    72:   4.46%  3.63%  5.23%
##    73:   4.42%  3.62%  5.17%
##    74:   4.43%  3.58%  5.21%
##    75:   4.42%  3.62%  5.16%
##    76:   4.48%  3.68%  5.22%
##    77:   4.52%  3.76%  5.22%
##    78:   4.51%  3.75%  5.20%
##    79:   4.55%  3.74%  5.29%
##    80:   4.51%  3.74%  5.21%
##    81:   4.49%  3.79%  5.14%
##    82:   4.51%  3.73%  5.23%
##    83:   4.49%  3.73%  5.19%
##    84:   4.46%  3.77%  5.09%
##    85:   4.39%  3.72%  5.00%
##    86:   4.36%  3.71%  4.96%
##    87:   4.37%  3.70%  4.99%
##    88:   4.38%  3.70%  5.01%
##    89:   4.42%  3.70%  5.09%
##    90:   4.41%  3.70%  5.07%
##    91:   4.37%  3.73%  4.96%
##    92:   4.35%  3.68%  4.97%
##    93:   4.35%  3.68%  4.97%
##    94:   4.38%  3.74%  4.97%
##    95:   4.31%  3.66%  4.91%
##    96:   4.35%  3.68%  4.97%
##    97:   4.36%  3.68%  4.99%
##    98:   4.36%  3.70%  4.98%
##    99:   4.38%  3.70%  5.00%
##   100:   4.37%  3.75%  4.95%
##   101:   4.40%  3.76%  4.99%
##   102:   4.35%  3.74%  4.91%
##   103:   4.34%  3.66%  4.98%
##   104:   4.36%  3.74%  4.94%
##   105:   4.36%  3.69%  4.98%
##   106:   4.37%  3.76%  4.94%
##   107:   4.34%  3.72%  4.92%
##   108:   4.34%  3.70%  4.94%
##   109:   4.32%  3.67%  4.92%
##   110:   4.35%  3.65%  4.99%
##   111:   4.33%  3.64%  4.97%
##   112:   4.32%  3.65%  4.95%
##   113:   4.32%  3.64%  4.95%
##   114:   4.26%  3.59%  4.88%
##   115:   4.31%  3.63%  4.95%
##   116:   4.30%  3.67%  4.88%
##   117:   4.31%  3.67%  4.91%
##   118:   4.36%  3.73%  4.94%
##   119:   4.30%  3.68%  4.88%
##   120:   4.31%  3.73%  4.85%
##   121:   4.34%  3.74%  4.89%
##   122:   4.37%  3.78%  4.91%
##   123:   4.30%  3.73%  4.83%
##   124:   4.27%  3.72%  4.79%
##   125:   4.27%  3.72%  4.78%
##   126:   4.29%  3.73%  4.81%
##   127:   4.27%  3.69%  4.81%
##   128:   4.22%  3.63%  4.78%
##   129:   4.25%  3.68%  4.78%
##   130:   4.24%  3.67%  4.78%
##   131:   4.27%  3.69%  4.80%
##   132:   4.26%  3.69%  4.79%
##   133:   4.23%  3.67%  4.75%
##   134:   4.26%  3.74%  4.75%
##   135:   4.24%  3.69%  4.76%
##   136:   4.24%  3.64%  4.80%
##   137:   4.26%  3.68%  4.80%
##   138:   4.23%  3.67%  4.75%
##   139:   4.23%  3.68%  4.75%
##   140:   4.22%  3.61%  4.79%
##   141:   4.22%  3.69%  4.71%
##   142:   4.22%  3.67%  4.73%
##   143:   4.17%  3.67%  4.64%
##   144:   4.19%  3.62%  4.73%
##   145:   4.20%  3.67%  4.70%
##   146:   4.20%  3.64%  4.73%
##   147:   4.22%  3.66%  4.74%
##   148:   4.21%  3.71%  4.67%
##   149:   4.18%  3.70%  4.62%
##   150:   4.21%  3.68%  4.70%

rm(trn_X)
tst = read.matrix.csr("tst.new")
tst_X1 = tst$x[1:20242,]
tst_X = as.matrix(tst_X1)
predictions3a = predict(model3, tst_X, type = "prob")[,2]
rm(tst_X)
tst_X2 = tst$x[20243:40484,]
tst_X = as.matrix(tst_X2)
predictions3b = predict(model3, tst_X, type = "prob")[,2]
rm(tst_X)
probabilities3 = c(predictions3a, predictions3b)

stop.time = Sys.time()
print(stop.time - start.time)

## Time difference of 55.4218 mins

probabilities = (probabilities1 + probabilities2 + probabilities3) / 3
predictions = probabilities
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "predictions.csv", quote=F, row.names = F)
print(Sys.time() - start.time)

## Time difference of 55.42461 mins

Homework #8

ddebarr@uw.edu

March 9, 2017

Chapter 9, Question 3

Chapter 9, Question 8

Kaggle: https://inclass.kaggle.com/c/ml210-reuters