Here we explore the maximal margin classifier on a toy data set.
3(a) We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class label.
Obs. | \(X_1\) | \(X_2\) | Y |
---|---|---|---|
1 | 3 | 4 | Red |
2 | 2 | 2 | Red |
3 | 4 | 4 | Red |
4 | 1 | 4 | Red |
5 | 2 | 1 | Blue |
6 | 4 | 3 | Blue |
7 | 4 | 1 | Blue |
Sketch the observations.
X = matrix(c(3, 4, 2, 2, 4, 4, 1, 4, 2, 1, 4, 3, 4, 1),
nrow = 7, byrow = T)
y = c(1, 1, 1, 1, 0, 0, 0)
plot(X, col = c(rep("red", 4), rep("blue", 3)),
pch = c(rep(16, 4), rep(17, 3)),
xlim = c(0, 5), ylim = c(0, 5))
3(b) Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form (9.1)).
Using the points (2, 1.5) and (4, 3.5), we derive the slope as (3.5 - 1.5) / (4 - 2) = 2 / 2 = 1.
Using the point (4, 3.5), we derive the intercept as follows …
y - 3.5 = slope * (x - 4)
y - 3.5 = 1 * (x - 4)
y - 3.5 = x - 4
y = x - 4 + 3.5
y = x - 0.5
plot(X, col = c(rep("red", 4), rep("blue", 3)),
pch = c(rep(16, 4), rep(17, 3)),
xlim = c(0, 5), ylim = c(0, 5))
slope = 1
intercept = - 0.5
abline(intercept, slope)
beta = c(-intercept, solve(matrix(c(2, 1.5, 4, 3.5), nrow = 2, byrow = T), c(intercept, intercept)))
beta
## [1] 0.5 -1.0 1.0
c(1, 2, 1.5) %*% beta
## [,1]
## [1,] 0
c(1, 4, 3.5) %*% beta
## [,1]
## [1,] 0
3(c) Describe the classification rule for the maximal margin classifier. It should be something along the lines of “Classify to Red if \({beta}_0\) + \({beta}_1\) * \(X_1\) + \({beta}_2\) * \(X_2\) > 0, and classify to Blue otherwise.” Provide the values for \({beta}_0\), \({beta}_1\), and \({beta}_2\).
Classify to Red if \({beta}_0\) + \({beta}_1\) * \(X_1\) + \({beta}_2\) * \(X_2\) > 0, and classify to Blue otherwise.
\({beta}_0\) = 0.5
\({beta}_1\) = -1
\({beta}_2\) = 1
3(d) On your sketch, indicate the margin for the maximal margin hyperplane.
c(1, 2, 2) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])
## [,1]
## [1,] 0.3535534
c(1, 4, 4) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])
## [,1]
## [1,] 0.3535534
c(1, 2, 1) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])
## [,1]
## [1,] -0.3535534
c(1, 4, 3) %*% beta / sqrt(t(beta[2:3]) %*% beta[2:3])
## [,1]
## [1,] -0.3535534
library(MASS)
eqscplot(X, col = c(rep("red", 4), rep("blue", 3)),
pch = c(rep(16, 4), rep(17, 3)),
xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
point = c(2,2)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
col = "red", lty = "solid")
point = c(4,4)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
col = "red", lty = "solid")
point = c(2,1)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
col = "blue", lty = "solid")
point = c(4,3)
lines(c(point[1], (beta[3] * ( beta[3] * point[1] - beta[2] * point[2]) - beta[1] * beta[2]) / (beta[2]^2 + beta[3]^2)),
c(point[2], (beta[2] * (-beta[3] * point[1] + beta[2] * point[2]) - beta[1] * beta[3]) / (beta[2]^2 + beta[3]^2)),
col = "blue", lty = "solid")
3(e) Indicate the support vectors for the maximal margin classifier.
plot(X, col = c(rep("red", 4), rep("blue", 3)),
pch = c(rep(16, 4), rep(17, 3)),
xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
points(c(2, 4), c(2, 4), pch = 8, col = "red")
points(c(2, 4), c(1, 3), pch = 8, col = "blue")
3(f) Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.
As long as the seventh observation does not get closer to the boundary than the existing support vectors (or move to the wrong side of the boundary), moving the seventh observation will not affect the maximal margin hyperplane.
3(g) Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.
Using the points (1, 0) and (5, 5), we derive the slope as (5 - 0) / (5 - 1) = 5 / 4 = 1.25.
Using the point (5, 5), we derive the intercept as follows …
y - 5 = slope * (x - 5)
y - 5 = 1.25 * (x - 5)
y - 5 = 1.25 * x - 6.25
y = 1.25 * x - 1.25
plot(X, col = c(rep("red", 4), rep("blue", 3)),
pch = c(rep(16, 4), rep(17, 3)),
xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
points(c(2, 4), c(2, 4), pch = 8, col = "red")
points(c(2, 4), c(1, 3), pch = 8, col = "blue")
slope2 = 1.25
intercept2 = - 1.25
abline(intercept2, slope2, lty = "dotted", col = "red")
beta2 = c(-intercept2, solve(matrix(c(2, 1.25, 4, 3.75), nrow = 2, byrow = T), c(intercept2, intercept2)))
beta2
## [1] 1.25 -1.25 1.00
c(1, 4, 4) %*% beta2 / sqrt(t(beta2[2:3]) %*% beta2[2:3])
## [,1]
## [1,] 0.1561738
c(1, 2, 1) %*% beta2 / sqrt(t(beta2[2:3]) %*% beta2[2:3])
## [,1]
## [1,] -0.1561738
3(h) Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.
plot(X, col = c(rep("red", 4), rep("blue", 3)),
pch = c(rep(16, 4), rep(17, 3)),
xlim = c(0, 5), ylim = c(0, 5))
abline(intercept, slope)
points(c(2, 4), c(2, 4), pch = 8, col = "red")
points(c(2, 4), c(1, 3), pch = 8, col = "blue")
abline(intercept2, slope2, lty = "dotted", col = "red")
points(4, 2, col = "red", pch = 16)
This problem involves the OJ data set which is part of the ISLR package.
8(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(123)
library(ISLR)
library(e1071)
index = sample(1:nrow(OJ))
trn = OJ[index[1:800],]
tst = OJ[index[801:length(index)],]
8(b) Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
model1 = svm(Purchase ~ ., data = trn, cost = 0.01)
summary(model1)
##
## Call:
## svm(formula = Purchase ~ ., data = trn, cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.01
## gamma: 0.05555556
##
## Number of Support Vectors: 635
##
## ( 316 319 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
8(c) What are the training and test error rates?
mean(predict(model1, trn) != trn$Purchase)
## [1] 0.395
mean(predict(model1, tst) != tst$Purchase)
## [1] 0.3740741
8(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
trControl = trainControl("repeatedcv", number = 10, repeats = 5)
model2 = train(Purchase ~ ., data = trn, method = "svmLinear", trControl = trControl,
tuneGrid = data.frame(C = c(0.01, 0.1, 1, 10)))
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
model2
## Support Vector Machines with Linear Kernel
##
## 800 samples
## 17 predictor
## 2 classes: 'CH', 'MM'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 721, 720, 720, 721, 720, 720, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.01 0.8265472 0.6318991
## 0.10 0.8300099 0.6400439
## 1.00 0.8297537 0.6392243
## 10.00 0.8307412 0.6410906
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 10.
trControl = trainControl("repeatedcv", number = 10, repeats = 5, index = model2$control$index)
8(e) Compute the training and test error rates using this new value for cost.
mean(predict(model2, trn) != trn$Purchase)
## [1] 0.15875
mean(predict(model2, tst) != tst$Purchase)
## [1] 0.1666667
8(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
model3 = train(Purchase ~ ., data = trn, method = "svmRadial", trControl = trControl,
tuneGrid = expand.grid(sigma = model1$gamma, C = c(0.01, 0.1, 1, 10)))
model3
## Support Vector Machines with Radial Basis Function Kernel
##
## 800 samples
## 17 predictor
## 2 classes: 'CH', 'MM'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 721, 720, 720, 721, 720, 720, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.01 0.6050080 0.0000000
## 0.10 0.8120679 0.5964606
## 1.00 0.8210157 0.6177806
## 10.00 0.8112277 0.5963457
##
## Tuning parameter 'sigma' was held constant at a value of 0.05555556
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05555556 and C = 1.
mean(predict(model3, trn) != trn$Purchase)
## [1] 0.15625
mean(predict(model3, tst) != tst$Purchase)
## [1] 0.1703704
8(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.
model4 = train(Purchase ~ ., data = trn, method = "svmPoly", trControl = trControl,
tuneGrid = expand.grid(scale = c(0.001, 0.01, 0.1), degree = 2, C = c(0.01, 0.1, 1, 10)))
model4
## Support Vector Machines with Polynomial Kernel
##
## 800 samples
## 17 predictor
## 2 classes: 'CH', 'MM'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 721, 720, 720, 721, 720, 720, ...
## Resampling results across tuning parameters:
##
## scale C Accuracy Kappa
## 0.001 0.01 0.6050080 0.0000000
## 0.001 0.10 0.6050080 0.0000000
## 0.001 1.00 0.8085428 0.5849145
## 0.001 10.00 0.8285317 0.6370273
## 0.010 0.01 0.6050080 0.0000000
## 0.010 0.10 0.8062990 0.5795615
## 0.010 1.00 0.8270409 0.6333583
## 0.010 10.00 0.8232532 0.6250304
## 0.100 0.01 0.8042862 0.5737344
## 0.100 0.10 0.8215221 0.6201319
## 0.100 1.00 0.8142523 0.6042239
## 0.100 10.00 0.8079802 0.5911289
##
## Tuning parameter 'degree' was held constant at a value of 2
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.001 and C
## = 10.
mean(predict(model4, trn) != trn$Purchase)
## [1] 0.16375
mean(predict(model4, tst) != tst$Purchase)
## [1] 0.1703704
8(h) Overall, which approach seems to give the best results on this data?
svmLinear2 with cost = 10 gave the best test set performance.
start.time = Sys.time()
setwd("C:/Data/Reuters")
set.seed(2^17-1)
library(e1071)
trn = read.matrix.csr("trn.dat")
tst = read.matrix.csr("tst.dat")
y = factor(trn$y, levels = c(0, 1), labels = c("no", "yes"))
model1 = svm(trn$x, y, kernel = "linear", cost = 1.0, probability = T)
predictions1 = predict(model1, newdata = tst$x, probability = T)
probabilities1 = attributes(predictions1)$probabilities[,"yes"]
stop.time = Sys.time()
print(stop.time - start.time)
## Time difference of 17.84765 mins
library(xgboost)
trn = xgb.DMatrix("trn.dat")
## [21:36:00] 20242x47237 matrix with 1498952 entries loaded from trn.dat
tst = xgb.DMatrix("tst.dat")
## [21:36:00] 40484x47237 matrix with 2982217 entries loaded from tst.dat
model2 = xgboost(trn, params = list(objective = "binary:logistic", eta = 0.1, max.depth = 16, eval_metric = "auc"), nround = 300)
## [1] train-auc:0.970756
## [2] train-auc:0.978411
## [3] train-auc:0.982390
## [4] train-auc:0.985174
## [5] train-auc:0.988158
## [6] train-auc:0.989377
## [7] train-auc:0.990673
## [8] train-auc:0.991882
## [9] train-auc:0.993226
## [10] train-auc:0.993890
## [11] train-auc:0.994734
## [12] train-auc:0.995348
## [13] train-auc:0.995823
## [14] train-auc:0.996375
## [15] train-auc:0.996709
## [16] train-auc:0.997049
## [17] train-auc:0.997386
## [18] train-auc:0.997648
## [19] train-auc:0.997889
## [20] train-auc:0.998102
## [21] train-auc:0.998282
## [22] train-auc:0.998447
## [23] train-auc:0.998629
## [24] train-auc:0.998744
## [25] train-auc:0.998874
## [26] train-auc:0.998984
## [27] train-auc:0.999075
## [28] train-auc:0.999129
## [29] train-auc:0.999178
## [30] train-auc:0.999244
## [31] train-auc:0.999300
## [32] train-auc:0.999363
## [33] train-auc:0.999409
## [34] train-auc:0.999453
## [35] train-auc:0.999483
## [36] train-auc:0.999508
## [37] train-auc:0.999548
## [38] train-auc:0.999571
## [39] train-auc:0.999595
## [40] train-auc:0.999614
## [41] train-auc:0.999641
## [42] train-auc:0.999658
## [43] train-auc:0.999668
## [44] train-auc:0.999690
## [45] train-auc:0.999717
## [46] train-auc:0.999740
## [47] train-auc:0.999752
## [48] train-auc:0.999761
## [49] train-auc:0.999776
## [50] train-auc:0.999785
## [51] train-auc:0.999796
## [52] train-auc:0.999808
## [53] train-auc:0.999820
## [54] train-auc:0.999830
## [55] train-auc:0.999846
## [56] train-auc:0.999852
## [57] train-auc:0.999860
## [58] train-auc:0.999867
## [59] train-auc:0.999879
## [60] train-auc:0.999886
## [61] train-auc:0.999892
## [62] train-auc:0.999899
## [63] train-auc:0.999906
## [64] train-auc:0.999909
## [65] train-auc:0.999914
## [66] train-auc:0.999918
## [67] train-auc:0.999921
## [68] train-auc:0.999924
## [69] train-auc:0.999929
## [70] train-auc:0.999932
## [71] train-auc:0.999935
## [72] train-auc:0.999938
## [73] train-auc:0.999940
## [74] train-auc:0.999944
## [75] train-auc:0.999947
## [76] train-auc:0.999949
## [77] train-auc:0.999951
## [78] train-auc:0.999952
## [79] train-auc:0.999954
## [80] train-auc:0.999956
## [81] train-auc:0.999958
## [82] train-auc:0.999959
## [83] train-auc:0.999961
## [84] train-auc:0.999963
## [85] train-auc:0.999964
## [86] train-auc:0.999965
## [87] train-auc:0.999967
## [88] train-auc:0.999968
## [89] train-auc:0.999971
## [90] train-auc:0.999973
## [91] train-auc:0.999974
## [92] train-auc:0.999976
## [93] train-auc:0.999976
## [94] train-auc:0.999978
## [95] train-auc:0.999978
## [96] train-auc:0.999979
## [97] train-auc:0.999980
## [98] train-auc:0.999981
## [99] train-auc:0.999981
## [100] train-auc:0.999983
## [101] train-auc:0.999983
## [102] train-auc:0.999984
## [103] train-auc:0.999985
## [104] train-auc:0.999986
## [105] train-auc:0.999987
## [106] train-auc:0.999988
## [107] train-auc:0.999988
## [108] train-auc:0.999989
## [109] train-auc:0.999989
## [110] train-auc:0.999989
## [111] train-auc:0.999990
## [112] train-auc:0.999990
## [113] train-auc:0.999991
## [114] train-auc:0.999991
## [115] train-auc:0.999991
## [116] train-auc:0.999991
## [117] train-auc:0.999992
## [118] train-auc:0.999992
## [119] train-auc:0.999993
## [120] train-auc:0.999993
## [121] train-auc:0.999993
## [122] train-auc:0.999993
## [123] train-auc:0.999994
## [124] train-auc:0.999994
## [125] train-auc:0.999994
## [126] train-auc:0.999994
## [127] train-auc:0.999994
## [128] train-auc:0.999994
## [129] train-auc:0.999994
## [130] train-auc:0.999995
## [131] train-auc:0.999995
## [132] train-auc:0.999995
## [133] train-auc:0.999995
## [134] train-auc:0.999995
## [135] train-auc:0.999995
## [136] train-auc:0.999996
## [137] train-auc:0.999996
## [138] train-auc:0.999996
## [139] train-auc:0.999996
## [140] train-auc:0.999996
## [141] train-auc:0.999996
## [142] train-auc:0.999996
## [143] train-auc:0.999996
## [144] train-auc:0.999996
## [145] train-auc:0.999996
## [146] train-auc:0.999997
## [147] train-auc:0.999997
## [148] train-auc:0.999997
## [149] train-auc:0.999997
## [150] train-auc:0.999997
## [151] train-auc:0.999997
## [152] train-auc:0.999997
## [153] train-auc:0.999997
## [154] train-auc:0.999997
## [155] train-auc:0.999997
## [156] train-auc:0.999997
## [157] train-auc:0.999997
## [158] train-auc:0.999997
## [159] train-auc:0.999998
## [160] train-auc:0.999998
## [161] train-auc:0.999998
## [162] train-auc:0.999998
## [163] train-auc:0.999998
## [164] train-auc:0.999998
## [165] train-auc:0.999998
## [166] train-auc:0.999998
## [167] train-auc:0.999998
## [168] train-auc:0.999998
## [169] train-auc:0.999998
## [170] train-auc:0.999998
## [171] train-auc:0.999998
## [172] train-auc:0.999998
## [173] train-auc:0.999998
## [174] train-auc:0.999998
## [175] train-auc:0.999998
## [176] train-auc:0.999998
## [177] train-auc:0.999998
## [178] train-auc:0.999998
## [179] train-auc:0.999998
## [180] train-auc:0.999999
## [181] train-auc:0.999999
## [182] train-auc:0.999999
## [183] train-auc:0.999999
## [184] train-auc:0.999999
## [185] train-auc:0.999999
## [186] train-auc:0.999999
## [187] train-auc:0.999999
## [188] train-auc:0.999999
## [189] train-auc:0.999999
## [190] train-auc:0.999999
## [191] train-auc:0.999999
## [192] train-auc:0.999999
## [193] train-auc:0.999999
## [194] train-auc:0.999999
## [195] train-auc:0.999999
## [196] train-auc:0.999999
## [197] train-auc:0.999999
## [198] train-auc:0.999999
## [199] train-auc:0.999999
## [200] train-auc:0.999999
## [201] train-auc:0.999999
## [202] train-auc:0.999999
## [203] train-auc:0.999999
## [204] train-auc:0.999999
## [205] train-auc:0.999999
## [206] train-auc:0.999999
## [207] train-auc:0.999999
## [208] train-auc:0.999999
## [209] train-auc:0.999999
## [210] train-auc:0.999999
## [211] train-auc:0.999999
## [212] train-auc:0.999999
## [213] train-auc:0.999999
## [214] train-auc:0.999999
## [215] train-auc:0.999999
## [216] train-auc:0.999999
## [217] train-auc:0.999999
## [218] train-auc:0.999999
## [219] train-auc:0.999999
## [220] train-auc:0.999999
## [221] train-auc:0.999999
## [222] train-auc:0.999999
## [223] train-auc:0.999999
## [224] train-auc:0.999999
## [225] train-auc:0.999999
## [226] train-auc:0.999999
## [227] train-auc:0.999999
## [228] train-auc:0.999999
## [229] train-auc:0.999999
## [230] train-auc:0.999999
## [231] train-auc:0.999999
## [232] train-auc:0.999999
## [233] train-auc:0.999999
## [234] train-auc:0.999999
## [235] train-auc:0.999999
## [236] train-auc:0.999999
## [237] train-auc:0.999999
## [238] train-auc:0.999999
## [239] train-auc:0.999999
## [240] train-auc:0.999999
## [241] train-auc:0.999999
## [242] train-auc:0.999999
## [243] train-auc:0.999999
## [244] train-auc:0.999999
## [245] train-auc:0.999999
## [246] train-auc:0.999999
## [247] train-auc:0.999999
## [248] train-auc:0.999999
## [249] train-auc:0.999999
## [250] train-auc:0.999999
## [251] train-auc:0.999999
## [252] train-auc:0.999999
## [253] train-auc:0.999999
## [254] train-auc:0.999999
## [255] train-auc:0.999999
## [256] train-auc:0.999999
## [257] train-auc:0.999999
## [258] train-auc:0.999999
## [259] train-auc:0.999999
## [260] train-auc:0.999999
## [261] train-auc:0.999999
## [262] train-auc:0.999999
## [263] train-auc:0.999999
## [264] train-auc:0.999999
## [265] train-auc:0.999999
## [266] train-auc:0.999999
## [267] train-auc:0.999999
## [268] train-auc:0.999999
## [269] train-auc:0.999999
## [270] train-auc:0.999999
## [271] train-auc:0.999999
## [272] train-auc:0.999999
## [273] train-auc:0.999999
## [274] train-auc:0.999999
## [275] train-auc:0.999999
## [276] train-auc:0.999999
## [277] train-auc:0.999999
## [278] train-auc:0.999999
## [279] train-auc:0.999999
## [280] train-auc:0.999999
## [281] train-auc:0.999999
## [282] train-auc:0.999999
## [283] train-auc:0.999999
## [284] train-auc:0.999999
## [285] train-auc:0.999999
## [286] train-auc:0.999999
## [287] train-auc:0.999999
## [288] train-auc:0.999999
## [289] train-auc:0.999999
## [290] train-auc:0.999999
## [291] train-auc:0.999999
## [292] train-auc:0.999999
## [293] train-auc:0.999999
## [294] train-auc:0.999999
## [295] train-auc:0.999999
## [296] train-auc:0.999999
## [297] train-auc:0.999999
## [298] train-auc:0.999999
## [299] train-auc:0.999999
## [300] train-auc:0.999999
predictions2 = predict(model2, tst)
probabilities2 = predictions2
stop.time = Sys.time()
print(stop.time - start.time)
## Time difference of 19.45295 mins
library(e1071)
# preprocessor: http://cross-entropy.net/ML210/FilterSparseData.py.txt
trn = read.matrix.csr("trn.new")
library(SparseM)
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
trn_X = as.matrix(trn$x)
y = factor(trn$y, levels = c(0, 1), labels = c("no", "yes"))
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
model3 = randomForest(trn_X, y, ntree = 150, sampsize = c(2000, 2000), do.trace = 1)
## ntree OOB 1 2
## 1: 20.06% 20.34% 19.81%
## 2: 19.24% 20.10% 18.44%
## 3: 15.83% 15.45% 16.18%
## 4: 14.45% 13.75% 15.10%
## 5: 12.35% 11.49% 13.16%
## 6: 11.14% 10.55% 11.68%
## 7: 10.15% 9.29% 10.95%
## 8: 9.39% 8.65% 10.08%
## 9: 8.73% 8.05% 9.37%
## 10: 8.41% 7.71% 9.06%
## 11: 7.76% 7.29% 8.20%
## 12: 7.36% 6.84% 7.84%
## 13: 7.25% 6.72% 7.75%
## 14: 6.97% 6.32% 7.57%
## 15: 6.84% 6.14% 7.49%
## 16: 6.79% 6.03% 7.49%
## 17: 6.58% 5.94% 7.18%
## 18: 6.49% 6.06% 6.88%
## 19: 6.31% 5.72% 6.85%
## 20: 6.10% 5.58% 6.58%
## 21: 6.13% 5.59% 6.63%
## 22: 5.96% 5.57% 6.33%
## 23: 5.95% 5.64% 6.23%
## 24: 5.82% 5.65% 5.99%
## 25: 5.76% 5.54% 5.97%
## 26: 5.75% 5.41% 6.06%
## 27: 5.70% 5.39% 5.98%
## 28: 5.60% 5.21% 5.97%
## 29: 5.58% 5.20% 5.93%
## 30: 5.50% 5.04% 5.93%
## 31: 5.36% 4.93% 5.76%
## 32: 5.33% 4.90% 5.73%
## 33: 5.25% 4.76% 5.71%
## 34: 5.16% 4.71% 5.58%
## 35: 5.16% 4.61% 5.66%
## 36: 5.09% 4.56% 5.58%
## 37: 5.08% 4.45% 5.66%
## 38: 5.07% 4.39% 5.71%
## 39: 5.10% 4.51% 5.65%
## 40: 5.03% 4.42% 5.60%
## 41: 5.04% 4.36% 5.67%
## 42: 5.01% 4.29% 5.68%
## 43: 4.97% 4.12% 5.77%
## 44: 4.97% 4.11% 5.77%
## 45: 4.88% 4.15% 5.56%
## 46: 4.80% 4.08% 5.47%
## 47: 4.83% 4.13% 5.47%
## 48: 4.83% 4.04% 5.57%
## 49: 4.86% 4.03% 5.62%
## 50: 4.80% 3.94% 5.60%
## 51: 4.75% 3.92% 5.53%
## 52: 4.80% 3.92% 5.61%
## 53: 4.69% 3.85% 5.47%
## 54: 4.65% 3.71% 5.52%
## 55: 4.66% 3.75% 5.50%
## 56: 4.66% 3.74% 5.52%
## 57: 4.65% 3.76% 5.47%
## 58: 4.59% 3.69% 5.42%
## 59: 4.60% 3.68% 5.46%
## 60: 4.56% 3.66% 5.40%
## 61: 4.51% 3.56% 5.39%
## 62: 4.50% 3.57% 5.37%
## 63: 4.47% 3.52% 5.36%
## 64: 4.48% 3.50% 5.39%
## 65: 4.45% 3.51% 5.32%
## 66: 4.46% 3.53% 5.33%
## 67: 4.51% 3.59% 5.37%
## 68: 4.47% 3.61% 5.26%
## 69: 4.50% 3.68% 5.26%
## 70: 4.48% 3.70% 5.19%
## 71: 4.45% 3.66% 5.19%
## 72: 4.46% 3.63% 5.23%
## 73: 4.42% 3.62% 5.17%
## 74: 4.43% 3.58% 5.21%
## 75: 4.42% 3.62% 5.16%
## 76: 4.48% 3.68% 5.22%
## 77: 4.52% 3.76% 5.22%
## 78: 4.51% 3.75% 5.20%
## 79: 4.55% 3.74% 5.29%
## 80: 4.51% 3.74% 5.21%
## 81: 4.49% 3.79% 5.14%
## 82: 4.51% 3.73% 5.23%
## 83: 4.49% 3.73% 5.19%
## 84: 4.46% 3.77% 5.09%
## 85: 4.39% 3.72% 5.00%
## 86: 4.36% 3.71% 4.96%
## 87: 4.37% 3.70% 4.99%
## 88: 4.38% 3.70% 5.01%
## 89: 4.42% 3.70% 5.09%
## 90: 4.41% 3.70% 5.07%
## 91: 4.37% 3.73% 4.96%
## 92: 4.35% 3.68% 4.97%
## 93: 4.35% 3.68% 4.97%
## 94: 4.38% 3.74% 4.97%
## 95: 4.31% 3.66% 4.91%
## 96: 4.35% 3.68% 4.97%
## 97: 4.36% 3.68% 4.99%
## 98: 4.36% 3.70% 4.98%
## 99: 4.38% 3.70% 5.00%
## 100: 4.37% 3.75% 4.95%
## 101: 4.40% 3.76% 4.99%
## 102: 4.35% 3.74% 4.91%
## 103: 4.34% 3.66% 4.98%
## 104: 4.36% 3.74% 4.94%
## 105: 4.36% 3.69% 4.98%
## 106: 4.37% 3.76% 4.94%
## 107: 4.34% 3.72% 4.92%
## 108: 4.34% 3.70% 4.94%
## 109: 4.32% 3.67% 4.92%
## 110: 4.35% 3.65% 4.99%
## 111: 4.33% 3.64% 4.97%
## 112: 4.32% 3.65% 4.95%
## 113: 4.32% 3.64% 4.95%
## 114: 4.26% 3.59% 4.88%
## 115: 4.31% 3.63% 4.95%
## 116: 4.30% 3.67% 4.88%
## 117: 4.31% 3.67% 4.91%
## 118: 4.36% 3.73% 4.94%
## 119: 4.30% 3.68% 4.88%
## 120: 4.31% 3.73% 4.85%
## 121: 4.34% 3.74% 4.89%
## 122: 4.37% 3.78% 4.91%
## 123: 4.30% 3.73% 4.83%
## 124: 4.27% 3.72% 4.79%
## 125: 4.27% 3.72% 4.78%
## 126: 4.29% 3.73% 4.81%
## 127: 4.27% 3.69% 4.81%
## 128: 4.22% 3.63% 4.78%
## 129: 4.25% 3.68% 4.78%
## 130: 4.24% 3.67% 4.78%
## 131: 4.27% 3.69% 4.80%
## 132: 4.26% 3.69% 4.79%
## 133: 4.23% 3.67% 4.75%
## 134: 4.26% 3.74% 4.75%
## 135: 4.24% 3.69% 4.76%
## 136: 4.24% 3.64% 4.80%
## 137: 4.26% 3.68% 4.80%
## 138: 4.23% 3.67% 4.75%
## 139: 4.23% 3.68% 4.75%
## 140: 4.22% 3.61% 4.79%
## 141: 4.22% 3.69% 4.71%
## 142: 4.22% 3.67% 4.73%
## 143: 4.17% 3.67% 4.64%
## 144: 4.19% 3.62% 4.73%
## 145: 4.20% 3.67% 4.70%
## 146: 4.20% 3.64% 4.73%
## 147: 4.22% 3.66% 4.74%
## 148: 4.21% 3.71% 4.67%
## 149: 4.18% 3.70% 4.62%
## 150: 4.21% 3.68% 4.70%
rm(trn_X)
tst = read.matrix.csr("tst.new")
tst_X1 = tst$x[1:20242,]
tst_X = as.matrix(tst_X1)
predictions3a = predict(model3, tst_X, type = "prob")[,2]
rm(tst_X)
tst_X2 = tst$x[20243:40484,]
tst_X = as.matrix(tst_X2)
predictions3b = predict(model3, tst_X, type = "prob")[,2]
rm(tst_X)
probabilities3 = c(predictions3a, predictions3b)
stop.time = Sys.time()
print(stop.time - start.time)
## Time difference of 55.4218 mins
probabilities = (probabilities1 + probabilities2 + probabilities3) / 3
predictions = probabilities
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "predictions.csv", quote=F, row.names = F)
print(Sys.time() - start.time)
## Time difference of 55.42461 mins