For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
2(a) The lasso, relative to least squares, is:
2(a)i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
2(a)ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
2(a)iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
2(a)iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Regarding i: increasing flexibility increases variance.
Regarding iv: decreasing flexibility decreases variance.
Statement iii applies to the lasso.
2(b) Repeat (a) for ridge regression relative to least squares.
Statement iii applies to ridge regression.
2(c) Repeat (a) for non-linear methods relative to least squares.
Statement ii applies to non-linear methods.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
9(a) Split the data set into a training set and a test set.
set.seed(2^17-1)
library(ISLR)
n.tot = nrow(College)
shuffle = sample(1:n.tot)
n.trn = round(0.8 * n.tot)
trn = College[shuffle[1:n.trn],]
tst = College[shuffle[(n.trn+1):n.tot],]
9(b) Fit a linear model using least squares on the training set, and report the test error obtained.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
trControl = trainControl("repeatedcv", number = 5, repeats = 50)
model.glm = train(Apps ~ ., data = trn, method = "glm", trControl = trControl)
model.glm
## Generalized Linear Model
##
## 622 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 497, 499, 497, 497, 498, 496, ...
## Resampling results:
##
## RMSE Rsquared
## 1130.731 0.9188572
##
##
mean((tst$Apps - predict(model.glm, newdata = tst))^2)
## [1] 1247649
9(c) Fit a ridge regression model on the training set, with lambda chosen by cross-validation. Report the test error obtained.
9(d) Fit a ridge regression model on the training set, with lambda chosen by cross-validation. Report the test error obtained.
model.glmnet = train(Apps ~ ., data = trn, method = "glmnet", trControl = trControl, tuneGrid = expand.grid(alpha = seq(0, 1, 0.1), lambda = c(1, 10, 100)))
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
model.glmnet # alpha = 0 for ridge regression, alpha = 1 for lasso
## glmnet
##
## 622 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 498, 496, 499, 498, 497, 498, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared
## 0.0 1 1229.021 0.9099608
## 0.0 10 1229.021 0.9099608
## 0.0 100 1229.021 0.9099608
## 0.1 1 1134.487 0.9172683
## 0.1 10 1134.804 0.9174515
## 0.1 100 1156.368 0.9179035
## 0.2 1 1133.915 0.9173020
## 0.2 10 1134.008 0.9175924
## 0.2 100 1153.589 0.9181616
## 0.3 1 1133.352 0.9173404
## 0.3 10 1133.226 0.9177263
## 0.3 100 1151.810 0.9181914
## 0.4 1 1132.827 0.9173748
## 0.4 10 1132.475 0.9178497
## 0.4 100 1152.478 0.9177699
## 0.5 1 1132.509 0.9173871
## 0.5 10 1131.735 0.9179661
## 0.5 100 1154.180 0.9171267
## 0.6 1 1132.429 0.9173738
## 0.6 10 1131.018 0.9180686
## 0.6 100 1154.635 0.9166191
## 0.7 1 1132.350 0.9173631
## 0.7 10 1130.272 0.9181694
## 0.7 100 1153.106 0.9163848
## 0.8 1 1132.185 0.9173623
## 0.8 10 1129.556 0.9182535
## 0.8 100 1150.472 0.9163189
## 0.9 1 1132.014 0.9173683
## 0.9 10 1128.805 0.9183462
## 0.9 100 1147.584 0.9162722
## 1.0 1 1131.715 0.9173789
## 1.0 10 1128.126 0.9184219
## 1.0 100 1145.034 0.9161823
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 10.
plot(model.glmnet)
mean((tst$Apps - predict(model.glmnet, newdata = tst))^2)
## [1] 1242704
9(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
model.pcr = train(Apps ~ ., data = trn, method = "pcr", trControl = trControl, tuneGrid = expand.grid(ncomp = c(1:(ncol(trn)-1))))
## Loading required package: pls
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
model.pcr
## Principal Component Analysis
##
## 622 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 498, 496, 498, 497, 499, 499, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared
## 1 3801.294 0.02097751
## 2 1815.887 0.79782108
## 3 1791.105 0.80350868
## 4 1757.499 0.81013587
## 5 1277.678 0.89843786
## 6 1194.210 0.90676880
## 7 1195.847 0.90645382
## 8 1197.849 0.90479897
## 9 1199.573 0.90449384
## 10 1168.731 0.90949458
## 11 1146.874 0.91298671
## 12 1148.171 0.91278272
## 13 1154.656 0.91235840
## 14 1153.297 0.91267723
## 15 1130.702 0.91620221
## 16 1134.364 0.91562785
## 17 1128.717 0.91690167
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 17.
mean((tst$Apps - predict(model.pcr, newdata = tst))^2)
## [1] 1247649
9(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
model.pls = train(Apps ~ ., data = trn, method = "pls", trControl = trControl, tuneGrid = expand.grid(ncomp = c(1:(ncol(trn)-1))))
model.pls
## Partial Least Squares
##
## 622 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 497, 498, 498, 498, 497, 497, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared
## 1 1778.728 0.8092939
## 2 1636.100 0.8386257
## 3 1416.261 0.8790881
## 4 1224.444 0.9069937
## 5 1192.859 0.9101475
## 6 1193.877 0.9095076
## 7 1194.862 0.9083396
## 8 1189.284 0.9089712
## 9 1180.327 0.9105525
## 10 1141.227 0.9165210
## 11 1135.013 0.9176930
## 12 1131.452 0.9187883
## 13 1122.976 0.9196680
## 14 1123.472 0.9195098
## 15 1124.836 0.9193441
## 16 1125.491 0.9192675
## 17 1120.640 0.9203231
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 17.
mean((tst$Apps - predict(model.pls, newdata = tst))^2)
## [1] 1247649
9(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
results = data.frame(
cv.rmse = c(
min(model.glm$results[,"RMSE"]),
min(model.glmnet$results[,"RMSE"]),
min(model.pcr$results[,"RMSE"]),
min(model.pls$results[,"RMSE"])
),
cv.rmse.sd = c(
min(model.glm$results[,"RMSESD"]),
min(model.glmnet$results[,"RMSESD"]),
min(model.pcr$results[,"RMSESD"]),
min(model.pls$results[,"RMSESD"])
),
tst.mse = c(
mean((tst$Apps - predict(model.glm, newdata = tst))^2),
mean((tst$Apps - predict(model.glmnet, newdata = tst))^2),
mean((tst$Apps - predict(model.pcr, newdata = tst))^2),
mean((tst$Apps - predict(model.pls, newdata = tst))^2)
)
)
results
## cv.rmse cv.rmse.sd tst.mse
## 1 1130.731 239.3209 1247649
## 2 1128.126 229.0373 1242704
## 3 1128.717 234.4598 1247649
## 4 1120.640 235.2747 1247649
The cross validation error is similar for these models. The PLS model has the lowest cross-validation error; however, the cross validation error for the regularized glmnet model is within one standard error of the lowest cross validation error (and it has lower complexity than the PLS model).
We expect the root mean squared error for predictions of the lasso model to be 1115 applications.
set.seed(2^17-1)
setwd("C:/Data/Twitter")
start.time = Sys.time()
trn_X = read.csv("trn_X.csv", header = F)
trn_y = scan("trn_y.txt")
tst_X = read.csv("tst_X.csv", header = F)
# vif() will go nuts!
relationships = array(0, c(choose(ncol(trn_X), 2), 4))
colnames(relationships) = c("var i", "var j", "correlation", "% equal")
index = 1
for (i in 1:(ncol(trn_X) - 1)) {
for (j in (i+1):ncol(trn_X)) {
r = abs(cor(trn_X[,i], trn_X[,j]))
relationships[index,1] = i
relationships[index,2] = j
relationships[index,3] = r
relationships[index,4] = mean(trn_X[,i] == trn_X[,j])
index = index + 1
}
}
relationships[relationships[,"correlation"] > 0.98,]
## var i var j correlation % equal
## [1,] 1 29 0.9966277 0.6583386
## [2,] 1 71 0.9999940 0.8379700
## [3,] 2 30 0.9967411 0.6641543
## [4,] 2 72 0.9999946 0.8461174
## [5,] 3 31 0.9970468 0.6487544
## [6,] 3 73 0.9999953 0.8425306
## [7,] 4 32 0.9972989 0.6354736
## [8,] 4 74 0.9999959 0.8340129
## [9,] 5 33 0.9975812 0.6228890
## [10,] 5 75 0.9999962 0.8254608
## [11,] 6 34 0.9977051 0.6055551
## [12,] 6 76 0.9999966 0.8177660
## [13,] 7 35 0.9977106 0.6058088
## [14,] 7 77 0.9999964 0.8133665
## [15,] 22 43 0.9871472 0.8379700
## [16,] 23 44 0.9897143 0.8461174
## [17,] 24 45 0.9926254 0.8425306
## [18,] 25 46 0.9913779 0.8340129
## [19,] 26 47 0.9902731 0.8254608
## [20,] 27 48 0.9834683 0.8177660
## [21,] 28 49 0.9822105 0.8133665
## [22,] 29 71 0.9968107 0.6678028
## [23,] 30 72 0.9969087 0.6732688
## [24,] 31 73 0.9971936 0.6570184
## [25,] 32 74 0.9974353 0.6434736
## [26,] 33 75 0.9977039 0.6312319
## [27,] 34 76 0.9978209 0.6140763
## [28,] 35 77 0.9978283 0.6148238
# redundant feature sets!
# 1, 29, 71
# 2, 30, 72
# 3, 31, 73
# 4, 32, 74
# 5, 33, 75
# 6, 34, 76
# 7, 35, 77
# 22, 43
# 23, 44
# 24, 45
# 25, 46
# 26, 47
# 27, 48
# 28, 49
trn_X = trn_X[, c(1:28, 36:42, 50:70)]
tst_X = tst_X[, c(1:28, 36:42, 50:70)]
trn = data.frame(trn_X, y = trn_y)
model = lm(y ~ ., data = trn)
library(car)
outliers = as.integer(names(outlierTest(model)$rstudent))
outliers
## [1] 34642 63575 122239 185980 206771 218091 245314 255108 271907 273154
leverage = which(cooks.distance(model) > 1)
leverage
## 29342 34642 63575 66084 89884 122239 123068 123989 134191 138117
## 29342 34642 63575 66084 89884 122239 123068 123989 134191 138117
## 177483 185980 204689 266127 271907 273154
## 177483 185980 204689 266127 271907 273154
remove = unique(sort(c(outliers, leverage)))
trn_X = trn_X[-remove,]
trn_y = trn_y[-remove]
# unnormalized Haar wavelet matrix
H = matrix(c( 1, 1, 1, 1, 1, 1, 1, 1, # sum
1, 1, 1, 1, -1, -1, -1, -1, # difference of two halves
1, 1, -1, -1, 0, 0, 0, 0, # ...
0, 0, 0, 0, 1, 1, -1, -1,
1, -1, 0, 0, 0, 0, 0, 0,
0, 0, 1, -1, 0, 0, 0, 0,
0, 0, 0, 0, 1, -1, 0, 0,
0, 0, 0, 0, 0, 0, 1, -1),
nrow = 8, byrow = T)
# padded with leading zero for 7 values (effectively)
Hp = H[2:8,2:8]
trn_X_new = cbind(as.matrix(trn_X[, 0+1:7]) %*% t(Hp),
as.matrix(trn_X[, 7+1:7]) %*% t(Hp),
as.matrix(trn_X[,14+1:7]) %*% t(Hp),
as.matrix(trn_X[,21+1:7]) %*% t(Hp),
as.matrix(trn_X[,28+1:7]) %*% t(Hp),
as.matrix(trn_X[,35+1:7]) %*% t(Hp),
as.matrix(trn_X[,42+1:7]) %*% t(Hp),
as.matrix(trn_X[,49+1:7]) %*% t(Hp),
apply(trn_X[, 0+1:7], 1, median),
apply(trn_X[, 7+1:7], 1, median),
apply(trn_X[,14+1:7], 1, median),
apply(trn_X[,21+1:7], 1, median),
apply(trn_X[,28+1:7], 1, median),
apply(trn_X[,35+1:7], 1, median),
apply(trn_X[,42+1:7], 1, median),
apply(trn_X[,49+1:7], 1, median))
tst_X_new = cbind(as.matrix(tst_X[, 0+1:7]) %*% t(Hp),
as.matrix(tst_X[, 7+1:7]) %*% t(Hp),
as.matrix(tst_X[,14+1:7]) %*% t(Hp),
as.matrix(tst_X[,21+1:7]) %*% t(Hp),
as.matrix(tst_X[,28+1:7]) %*% t(Hp),
as.matrix(tst_X[,35+1:7]) %*% t(Hp),
as.matrix(tst_X[,42+1:7]) %*% t(Hp),
as.matrix(tst_X[,49+1:7]) %*% t(Hp),
apply(tst_X[, 0+1:7], 1, median),
apply(tst_X[, 7+1:7], 1, median),
apply(tst_X[,14+1:7], 1, median),
apply(tst_X[,21+1:7], 1, median),
apply(tst_X[,28+1:7], 1, median),
apply(tst_X[,35+1:7], 1, median),
apply(tst_X[,42+1:7], 1, median),
apply(tst_X[,49+1:7], 1, median))
vars = c("NCD", "AI", "ASNA", "BL", "ASNAC", "AT", "NA", "ADL")
type = c("H1", "H2", "H3", "H4", "H5", "H6", "H7", "M")
namelist = c(paste(vars[1], type, sep = "_"),
paste(vars[2], type, sep = "_"),
paste(vars[3], type, sep = "_"),
paste(vars[4], type, sep = "_"),
paste(vars[5], type, sep = "_"),
paste(vars[6], type, sep = "_"),
paste(vars[7], type, sep = "_"),
paste(vars[8], type, sep = "_"))
colnames(trn_X_new) = namelist
colnames(tst_X_new) = namelist
max.p = ncol(trn_X_new) / 2
library(leaps)
subsets = regsubsets(trn_X_new, trn_y, nvmax = max.p, method = "backward")
p = max.p
results = array(0, c(p, 28))
colnames(results) = list("Variable Count", "RMSE", "RMSE SD",
"RMSE R01", "RMSE R02", "RMSE R03", "RMSE R04", "RMSE R05",
"RMSE R06", "RMSE R07", "RMSE R08", "RMSE R09", "RMSE R10",
"RMSE R11", "RMSE R12", "RMSE R13", "RMSE R14", "RMSE R15",
"RMSE R16", "RMSE R17", "RMSE R18", "RMSE R19", "RMSE R20",
"RMSE R21", "RMSE R22", "RMSE R23", "RMSE R24", "RMSE R25")
best.loss = Inf
best.subset = NA
best.model = NA
start.time = Sys.time()
library(caret)
trControl = trainControl("repeatedcv", number = 5, repeats = 5)
for (i in 1:p) {
selected = which(summary(subsets)$outmat[i,] == "*")
model = train(as.matrix(trn_X_new[,selected]), trn_y, method = "glm", trControl = trControl)
results[i,1] = i
results[i,2] = model$results[,"RMSE"]
results[i,3] = model$results[,"RMSESD"]
results[i,4:28] = model$resample[,"RMSE"]
if (results[i,2] < best.loss) {
best.loss = results[i,2]
best.subset = selected
best.model = model
}
if (i == 1) {
trControl = trainControl("repeatedcv", number = 5, repeats = 5, index = model$control$index)
}
}
results = results[order(results[,2]),]
results[,1:3]
## Variable Count RMSE RMSE SD
## [1,] 23 124.0509 2.850015
## [2,] 25 124.0515 2.825771
## [3,] 24 124.0549 2.823891
## [4,] 26 124.0633 2.829938
## [5,] 27 124.0652 2.835105
## [6,] 22 124.0711 2.902515
## [7,] 28 124.0755 2.839648
## [8,] 32 124.0887 2.824201
## [9,] 29 124.0895 2.831793
## [10,] 31 124.0895 2.825100
## [11,] 30 124.0933 2.830076
## [12,] 21 124.1285 2.912591
## [13,] 20 124.1992 2.941170
## [14,] 19 124.2875 2.974100
## [15,] 18 124.3825 2.972516
## [16,] 17 124.4857 3.018326
## [17,] 16 124.4980 3.025227
## [18,] 15 124.6198 3.108199
## [19,] 14 124.7121 3.160405
## [20,] 13 124.9464 3.084340
## [21,] 12 125.1789 3.128515
## [22,] 11 125.7637 3.150650
## [23,] 10 127.4467 3.349085
## [24,] 9 129.8103 3.535201
## [25,] 8 130.5367 3.672816
## [26,] 7 133.5558 3.579327
## [27,] 6 137.1651 3.735532
## [28,] 5 140.8570 4.252061
## [29,] 4 147.7805 5.507917
## [30,] 3 164.6736 8.504385
## [31,] 2 178.6657 8.131986
## [32,] 1 241.3267 10.763414
resample.count = nrow(best.model$resample)
best.model$results
## parameter RMSE Rsquared RMSESD RsquaredSD
## 1 none 124.0509 0.9534889 2.850015 0.003240858
best.model$resample
## RMSE Rsquared Resample
## 1 124.5772 0.9524852 Fold1.Rep1
## 2 127.1163 0.9563523 Fold2.Rep1
## 3 120.2275 0.9519246 Fold3.Rep1
## 4 121.2666 0.9559187 Fold4.Rep1
## 5 127.1506 0.9513295 Fold5.Rep1
## 6 125.6395 0.9516227 Fold1.Rep2
## 7 123.1399 0.9532161 Fold2.Rep2
## 8 122.6864 0.9561940 Fold3.Rep2
## 9 122.1094 0.9501842 Fold4.Rep2
## 10 127.5716 0.9566635 Fold5.Rep2
## 11 126.5753 0.9539469 Fold1.Rep3
## 12 121.2510 0.9480023 Fold2.Rep3
## 13 123.1746 0.9555970 Fold3.Rep3
## 14 126.1254 0.9483468 Fold4.Rep3
## 15 123.0001 0.9602151 Fold5.Rep3
## 16 129.7803 0.9500548 Fold1.Rep4
## 17 122.2231 0.9540476 Fold2.Rep4
## 18 122.9460 0.9525887 Fold3.Rep4
## 19 121.0668 0.9608974 Fold4.Rep4
## 20 123.5350 0.9500228 Fold5.Rep4
## 21 127.8841 0.9557397 Fold1.Rep5
## 22 121.5952 0.9517692 Fold2.Rep5
## 23 118.2577 0.9521097 Fold3.Rep5
## 24 125.3027 0.9547981 Fold4.Rep5
## 25 127.0693 0.9531948 Fold5.Rep5
mean(best.model$resample[,"RMSE"])
## [1] 124.0509
sd(best.model$resample[,"RMSE"])
## [1] 2.850015
# standard error of the mean: check out "?oneSE"
sd(best.model$resample[,"RMSE"]) / sqrt(resample.count)
## [1] 0.5700031
# two (independent) sample t test
t.test(results[1,4:28], results[2,4:28])
##
## Welch Two Sample t-test
##
## data: results[1, 4:28] and results[2, 4:28]
## t = -0.00079593, df = 47.996, p-value = 0.9994
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.614548 1.613270
## sample estimates:
## mean of x mean of y
## 124.0509 124.0515
difference = results[1,"RMSE"] - results[2,"RMSE"]
difference.stderr = sqrt((results[1,"RMSE SD"]^2) / resample.count +
(results[2,"RMSE SD"]^2) / resample.count)
test.statistic = difference / difference.stderr
test.statistic
## RMSE
## -0.0007959327
df = (difference.stderr^4) /
((results[1,"RMSE SD"]^4) / ((resample.count^2)*(resample.count-1)) +
(results[2,"RMSE SD"]^4) / ((resample.count^2)*(resample.count-1)))
df
## RMSE SD
## 47.9965
2*pt(-abs(test.statistic), df)
## RMSE
## 0.9993682
# one (dependent) sample t test: note the use of index for trainControl()
difference = results[1,4:28] - results[2,4:28]
t.test(difference)
##
## One Sample t-test
##
## data: difference
## t = -0.035759, df = 24, p-value = 0.9718
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.03751298 0.03623521
## sample estimates:
## mean of x
## -0.000638883
mean.difference = mean(difference)
difference.stderr = sqrt(var(difference) / resample.count)
test.statistic = mean.difference / difference.stderr
test.statistic
## [1] -0.03575924
df = resample.count - 1
df
## [1] 24
2*pt(-abs(test.statistic), df)
## [1] 0.9717701
# comparing the t and normal distributions
x = seq(-4, 4, 0.001)
plot(x, dt(x, 30), typ = "l", lty = "dotted", xlab = "", ylab = "")
lines(x, dnorm(x))
lines(x, dt(x, 5), lty="dotted", col = "red")
legend("topleft", legend = c("t(5)", "t(30)", "norm(0,1)"),
lty = c("dotted", "dotted", "solid"),
col = c("red", "black", "black"))
tuneGrid = expand.grid(alpha = c(0, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 1),
lambda = c(0, 0.1, 1, 10, 100))
model = train(as.matrix(trn_X_new[,best.subset]), trn_y, method = "glmnet", trControl = trControl, tuneGrid = tuneGrid)
model
## glmnet
##
## 291605 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 233284, 233285, 233284, 233283, 233284, 233286, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared
## 0.00 0.0 133.1263 0.9469664
## 0.00 0.1 133.1263 0.9469664
## 0.00 1.0 133.1263 0.9469664
## 0.00 10.0 133.1263 0.9469664
## 0.00 100.0 139.4779 0.9422774
## 0.10 0.0 124.5517 0.9531257
## 0.10 0.1 124.5517 0.9531257
## 0.10 1.0 124.5642 0.9531170
## 0.10 10.0 127.8952 0.9506725
## 0.10 100.0 146.4523 0.9375175
## 0.25 0.0 124.2576 0.9533400
## 0.25 0.1 124.2576 0.9533400
## 0.25 1.0 124.6625 0.9530466
## 0.25 10.0 128.7765 0.9500575
## 0.25 100.0 151.5284 0.9360753
## 0.50 0.0 124.3212 0.9532924
## 0.50 0.1 124.3212 0.9532924
## 0.50 1.0 125.0899 0.9527347
## 0.50 10.0 130.8518 0.9485646
## 0.50 100.0 163.3900 0.9324531
## 0.75 0.0 124.5133 0.9531590
## 0.75 0.1 124.5133 0.9531590
## 0.75 1.0 125.4925 0.9524360
## 0.75 10.0 133.8025 0.9463458
## 0.75 100.0 177.2371 0.9283531
## 0.90 0.0 124.4005 0.9532382
## 0.90 0.1 124.4054 0.9532352
## 0.90 1.0 125.7694 0.9522303
## 0.90 10.0 135.0048 0.9454623
## 0.90 100.0 187.9930 0.9229503
## 0.95 0.0 124.3848 0.9532486
## 0.95 0.1 124.3924 0.9532433
## 0.95 1.0 125.8666 0.9521577
## 0.95 10.0 135.3254 0.9452351
## 0.95 100.0 192.1861 0.9201155
## 1.00 0.0 124.4168 0.9532236
## 1.00 0.1 124.4243 0.9532192
## 1.00 1.0 125.9606 0.9520874
## 1.00 10.0 135.6534 0.9450020
## 1.00 100.0 195.7072 0.9179372
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.25 and lambda = 0.1.
predictions = predict(model, tst_X_new[, best.subset])
predictions[predictions < 0] = 0.0
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "predictions.csv", quote=F, row.names = F)
Sys.time() - start.time
## Time difference of 18.7484 mins
lambda = model$results[which.min(model$results[,"RMSE"]),"lambda"]
coef(model$finalModel, s = lambda)
## 24 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.166374e+00
## NCD_H1 -2.395074e-01
## NCD_H2 -1.960525e-01
## NCD_H3 -2.254694e-01
## NCD_H4 -4.295290e-01
## NCD_H7 -3.657048e-01
## NCD_M -1.395812e-01
## AI_H2 -7.892024e-02
## AI_H3 1.248439e-01
## AI_H4 -1.681806e-01
## AI_H7 5.151869e+04
## AI_M 3.740595e+04
## ASNA_H1 9.012873e+04
## ASNA_H2 -2.326409e+04
## ASNA_H3 8.965083e+04
## ASNA_H5 1.512967e+05
## BL_M 4.176294e+04
## ASNAC_H1 -1.397524e+05
## AT_H3 1.601273e-01
## AT_H5 5.447866e-02
## AT_H6 6.638809e-02
## AT_H7 1.189471e-01
## ADL_H1 1.739765e-01
## ADL_H3 1.795228e+05