We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.
2(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.
The probability that the jth observation is the first bootstrap sample is 1/n, so the probability that the jth observation is not the first bootstrap sample is 1 - 1/n.
2(b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?
The probability that the jth observation is the second bootstrap sample is 1/n, so the probability that the jth observation is not the second bootstrap sample is 1 - 1/n.
2(c) Argue that the probability that the jth observation is not in the bootstrap sample is (1 - 1/n)^n.
As we are using sampling with replacement to generate the bootstrap sample, the selection probabilities are independent; so Probability(Observation j is not the first bootstrap observation, Observation j is not the second bootstrap observation, …, Observation j is not the nth bootstrap observation) = Probability(Observation j is not the first bootstrap observation) * Probability(Observation j is not the second bootstrap observation) * … * Probability(Observation j is not the nth bootstrap observation).
2(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?
For n = 5, 1 - (1 - 1/n)^n = 1 - (1 - 1/5)^5 = 0.67232
2(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?
For n = 100, 1 - (1 - 1/n)^n = 1 - (1 - 1/100)^100 = 0.6339677
2(f) When n = 10,000, what is the probability that the jth observation is in the bootstrap sample?
For n = 10000, 1 - (1 - 1/n)^n = 1 - (1 - 1/10000)^10000 = 0.632139
2(g) Create a plot that displays, for each integer value of n from 1 to 100,000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.
n = 1:100000
plot(n, 1 - (1 - 1/n)^n, typ = "l", log = "x")
abline(h = 1 - exp(-1), lty = "dotted")
The limit of (1 - 1/n)^n as n goes to infinity is exp(-1); so as n gets larger we see 1 - (1 - 1/n)^n approach 1 - exp(-1), which is approximately equal to 0.632.
2(h) We will now investigate numerically the probability that a bootstrap sample of size n = 100 contains the jth observation. Here j = 4. We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample.
n = 10000
store = rep(NA, n)
for (i in 1:n) {
store[i] = sum(sample(1:100, rep = T) == 4) > 0
}
mean(store)
## [1] 0.633
Comment on the results obtained.
The mean(store) call yields the proportion of bootstrap samples that contain the 4th observation.
As expected, we see the selection probability is approximately equal to 1 - (1 - 1/100)^100, which is approximate equal to 1 - exp(-1).
We will now perform cross-validation on a simulated data set.
8(a) Generate a simulated data set as follows:
set.seed(1)
# y = rnorm(100) # y is replaced before we use it
n = 100
x = rnorm(n)
y = x - 2 * (x^2) + rnorm(n)
In this data set, what is n and what is p? Write out the model used to generate the data in equation form.
n = 100
p = 2
Y = 0 + X - 2 * (X^2) + \(\epsilon\)
8(b) Create a scatterplot of X against Y. Comment on what you find.
plot(x, y, typ = "p")
As we might expect, the negative quadratic term generates an upside down bowl (parabola) [called quadratic because a square has four sides].
8(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:
8(c)i. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\epsilon\)
8(c)ii. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\beta_2\) * \(X^2\) + \(\epsilon\)
8(c)iii. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\beta_2\) * \(X^2\) + \(\beta_3\) * \(X^3\) + \(\epsilon\)
8(c)iv. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\beta_2\) * \(X^2\) + \(\beta_3\) * \(X^3\) + \(\beta_4\) * \(X^4\) + \(\epsilon\)
Note you may find it helpful to use the data.frame() function to create a single data set containing both X and Y.
seed = sample.int(2147483647, size = 1)
seed
## [1] 1414928891
set.seed(seed)
library(boot)
Simulation = data.frame(y = y, x = x)
model1.linear = glm(y ~ poly(x,1), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.linear)$delta[1]
## [1] 7.288162
model1.quadratic = glm(y ~ poly(x,2), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.quadratic)$delta[1]
## [1] 0.9374236
model1.cubic = glm(y ~ poly(x,3), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.cubic)$delta[1]
## [1] 0.9566218
model1.quartic = glm(y ~ poly(x,4), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.quartic)$delta[1]
## [1] 0.9539049
8(d) Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?
seed = sample.int(2147483647, size = 1)
seed
## [1] 2044045886
set.seed(seed)
model2.linear = glm(y ~ poly(x,1), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.linear)$delta[1]
## [1] 7.288162
model2.quadratic = glm(y ~ poly(x,2), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.quadratic)$delta[1]
## [1] 0.9374236
model2.cubic = glm(y ~ poly(x,3), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.cubic)$delta[1]
## [1] 0.9566218
model2.quartic = glm(y ~ poly(x,4), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.quartic)$delta[1]
## [1] 0.9539049
Yes; the results are identical to (c), because leave-one-out cross-validation uses all possible train/validation splits where 1 observation is used for validation and the other n-1 observations are used for training.
8(e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.
The quadratic model has the smallest LOOCV. Yes; this is what was expected, because the quadratic model matches the model used to generate the data [and hold out data is being used for validation].
8(f) Comment on the statistical significance of the coefficient estimates that results from your fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
summary(model1.quadratic)
##
## Call:
## glm(formula = y ~ poly(x, 2), family = gaussian, data = Simulation)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9650 -0.6254 -0.1288 0.5803 2.2700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5500 0.0958 -16.18 < 2e-16 ***
## poly(x, 2)1 6.1888 0.9580 6.46 4.18e-09 ***
## poly(x, 2)2 -23.9483 0.9580 -25.00 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9178258)
##
## Null deviance: 700.852 on 99 degrees of freedom
## Residual deviance: 89.029 on 97 degrees of freedom
## AIC: 280.17
##
## Number of Fisher Scoring iterations: 2
mean(Simulation$y)
## [1] -1.550023
confint(model1.linear)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.059640 -1.040405
## poly(x, 1) 1.092648 11.285003
confint(model1.quadratic)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.737793 -1.362252
## poly(x, 2)1 4.311117 8.066534
## poly(x, 2)2 -25.826014 -22.070596
confint(model1.cubic)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.738695 -1.361350
## poly(x, 3)1 4.302102 8.075549
## poly(x, 3)2 -25.835029 -22.061581
## poly(x, 3)3 -1.622618 2.150829
confint(model1.quartic)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.7379932 -1.362052
## poly(x, 4)1 4.3091193 8.068532
## poly(x, 4)2 -25.8280112 -22.068599
## poly(x, 4)3 -1.6156006 2.143812
## poly(x, 4)4 -0.6226113 3.136801
As we would expect, the confidence intervals for the the coefficients of the linear and quadratic terms exclude zero (they’re statistically significant), while the confidence intervals for the coefficients of the cubic and quartic terms include zero (they’re not statistically significant). The confidence interval for the intercept also excludes zero, reflecting the mean of the response. The mean of the response for the population is -2, because the mean of a squared standard normal variable is 1.
set.seed(2^17 - 1)
options(width = 100)
start.time = Sys.time()
namelist = c("fixed_acidity", "volatile_acidity", "citric_acid",
"residual_sugar", "chlorides", "free_sulfur_dioxide",
"total_sulfur_dioxide", "density", "pH",
"sulphates", "alcohol")
trn_X = read.csv("C:/Data/wine/trn_X.csv", header = F)
trn_y = scan("C:/Data/wine/trn_y.txt")
tst_X = read.csv("C:/Data/wine/tst_X.csv", header = F)
trn_X = trn_X[-c(1780,3619),] # removing an outlier and a
trn_y = trn_y[-c(1780,3619)] # high-leverage observation
p = ncol(trn_X)
augmented_trn_X = trn_X
augmented_tst_X = tst_X
for (i in 1:p) {
augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,i])
augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,i])
namelist = c(namelist, paste(namelist[i], "_2", sep = ""))
}
for (i in 1:(p-1)) {
for (j in (i+1):p) {
augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,j])
augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,j])
namelist = c(namelist, paste(namelist[i], ".", namelist[j], sep = ""))
}
}
colnames(augmented_trn_X) = namelist
colnames(augmented_tst_X) = namelist
max.p = ncol(augmented_trn_X)
library(leaps)
subsets = regsubsets(augmented_trn_X, trn_y, nvmax = max.p, method = "backward")
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: ggplot2
results = array(0, c(max.p, 3))
colnames(results) = list("Variable Count", "RMSE", "RMSE SD")
best.loss = Inf
best.subset = NA
best.model = NA
for (i in 1:max.p) {
selected = which(summary(subsets)$outmat[i,] == "*")
seeds = vector(mode = "list", length = 151)
for (seed in 1:length(seeds)) {
seeds[[seed]] = sample.int(1000, 1)
}
trControl = trainControl("repeatedcv", number = 5, repeats = 30, seeds = seeds)
model = train(as.matrix(augmented_trn_X[,selected]), trn_y, method = "glm", trControl = trControl)
results[i,1] = i
results[i,2] = model$results[,"RMSE"]
results[i,3] = model$results[,"RMSESD"]
if (results[i,2] < best.loss) {
best.loss = results[i,2]
best.subset = selected
best.model = model
}
}
results[order(results[,2]),]
## Variable Count RMSE RMSE SD
## [1,] 41 0.7085485 0.01871094
## [2,] 36 0.7092587 0.01949993
## [3,] 34 0.7093678 0.01824183
## [4,] 33 0.7094185 0.01857453
## [5,] 40 0.7094698 0.01781984
## [6,] 35 0.7097984 0.01873566
## [7,] 30 0.7098309 0.01940009
## [8,] 31 0.7101589 0.01666739
## [9,] 28 0.7102652 0.01862940
## [10,] 32 0.7104037 0.01907901
## [11,] 38 0.7104527 0.01905178
## [12,] 37 0.7104535 0.01948504
## [13,] 42 0.7108457 0.02281870
## [14,] 27 0.7110959 0.01815007
## [15,] 29 0.7111491 0.02064502
## [16,] 26 0.7113752 0.01857074
## [17,] 25 0.7114544 0.01884820
## [18,] 24 0.7119206 0.01580491
## [19,] 22 0.7122586 0.01966434
## [20,] 59 0.7125025 0.01640043
## [21,] 23 0.7125048 0.01794292
## [22,] 60 0.7126336 0.01778928
## [23,] 39 0.7131046 0.02285295
## [24,] 61 0.7131908 0.01865447
## [25,] 21 0.7133921 0.01866756
## [26,] 18 0.7135498 0.01864134
## [27,] 58 0.7138980 0.02168585
## [28,] 67 0.7141555 0.01903082
## [29,] 17 0.7141823 0.01851187
## [30,] 65 0.7142260 0.02231889
## [31,] 20 0.7142270 0.01721641
## [32,] 68 0.7142679 0.01931216
## [33,] 51 0.7142817 0.02142288
## [34,] 48 0.7143165 0.02361784
## [35,] 47 0.7143901 0.02421665
## [36,] 62 0.7145024 0.02325654
## [37,] 19 0.7146481 0.01856070
## [38,] 64 0.7146602 0.01749679
## [39,] 46 0.7148674 0.02278700
## [40,] 16 0.7149832 0.01728402
## [41,] 63 0.7150604 0.02019288
## [42,] 57 0.7151491 0.02779784
## [43,] 45 0.7151819 0.02681959
## [44,] 52 0.7151943 0.02353478
## [45,] 66 0.7155000 0.01984962
## [46,] 56 0.7157937 0.02381194
## [47,] 15 0.7158139 0.01815106
## [48,] 53 0.7161532 0.02548414
## [49,] 70 0.7161900 0.02086754
## [50,] 69 0.7162934 0.02257229
## [51,] 54 0.7164297 0.02502509
## [52,] 74 0.7168621 0.01932120
## [53,] 43 0.7169187 0.02938766
## [54,] 55 0.7170318 0.02431918
## [55,] 73 0.7173242 0.02077185
## [56,] 71 0.7174705 0.02051052
## [57,] 14 0.7174737 0.01922648
## [58,] 75 0.7177734 0.02095710
## [59,] 44 0.7179741 0.02572717
## [60,] 50 0.7181618 0.02597910
## [61,] 72 0.7182445 0.02120690
## [62,] 13 0.7212470 0.01764907
## [63,] 49 0.7213231 0.03786333
## [64,] 76 0.7214024 0.03062883
## [65,] 77 0.7225585 0.02832052
## [66,] 12 0.7241923 0.01607927
## [67,] 11 0.7268640 0.01931085
## [68,] 10 0.7298327 0.01559887
## [69,] 9 0.7312996 0.01837394
## [70,] 8 0.7382329 0.01852615
## [71,] 7 0.7394196 0.01958222
## [72,] 6 0.7428632 0.01765997
## [73,] 5 0.7498079 0.01789484
## [74,] 4 0.7586798 0.01893734
## [75,] 3 0.7603470 0.01915802
## [76,] 2 0.7650833 0.01808504
## [77,] 1 0.8647357 0.01701741
predictions = predict(best.model, newdata = as.matrix(augmented_tst_X[,best.subset]))
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/wine/predictions.csv", quote=F, row.names = F)
Sys.time() - start.time
## Time difference of 10.90899 mins
summary(best.model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9924 -0.4829 -0.0186 0.4355 3.0043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.159e+04 1.357e+03 8.542 < 2e-16 ***
## volatile_acidity -1.550e+01 2.655e+00 -5.839 5.68e-09 ***
## citric_acid 8.558e+01 3.667e+01 2.334 0.019651 *
## chlorides -1.146e+03 4.747e+02 -2.414 0.015824 *
## free_sulfur_dioxide -3.417e+00 9.062e-01 -3.770 0.000165 ***
## density -2.440e+04 2.788e+03 -8.751 < 2e-16 ***
## pH 4.696e+02 5.413e+01 8.676 < 2e-16 ***
## sulphates 1.593e+02 6.769e+01 2.354 0.018628 *
## alcohol -1.308e+01 6.422e+00 -2.036 0.041812 *
## fixed_acidity_2 -3.143e-02 9.920e-03 -3.168 0.001544 **
## citric_acid_2 -1.009e+00 3.077e-01 -3.278 0.001054 **
## residual_sugar_2 -4.042e-03 6.379e-04 -6.336 2.63e-10 ***
## chlorides_2 1.520e+01 7.452e+00 2.040 0.041465 *
## free_sulfur_dioxide_2 -1.263e-04 3.444e-05 -3.668 0.000248 ***
## density_2 1.285e+04 1.437e+03 8.941 < 2e-16 ***
## pH_2 1.771e+00 4.313e-01 4.107 4.09e-05 ***
## alcohol_2 2.491e-02 1.361e-02 1.830 0.067325 .
## fixed_acidity.density -1.120e+00 4.778e-01 -2.344 0.019129 *
## fixed_acidity.pH 5.056e-01 1.230e-01 4.110 4.03e-05 ***
## fixed_acidity.sulphates 2.591e-01 1.404e-01 1.846 0.065013 .
## volatile_acidity.pH 2.818e+00 8.180e-01 3.445 0.000578 ***
## volatile_acidity.alcohol 4.719e-01 9.370e-02 5.036 4.97e-07 ***
## citric_acid.density -8.903e+01 3.633e+01 -2.450 0.014312 *
## citric_acid.pH 1.179e+00 6.775e-01 1.740 0.081938 .
## residual_sugar.chlorides -9.352e-01 2.418e-01 -3.867 0.000112 ***
## residual_sugar.free_sulfur_dioxide -1.271e-03 3.747e-04 -3.394 0.000697 ***
## residual_sugar.pH 7.402e-02 6.053e-03 12.230 < 2e-16 ***
## chlorides.density 1.194e+03 4.778e+02 2.499 0.012485 *
## chlorides.pH -1.012e+01 4.543e+00 -2.228 0.025956 *
## chlorides.sulphates -1.742e+01 5.667e+00 -3.074 0.002124 **
## free_sulfur_dioxide.total_sulfur_dioxide -9.279e-05 2.177e-05 -4.262 2.07e-05 ***
## free_sulfur_dioxide.density 3.396e+00 9.035e-01 3.759 0.000173 ***
## free_sulfur_dioxide.sulphates 2.943e-02 7.069e-03 4.163 3.21e-05 ***
## free_sulfur_dioxide.alcohol 6.166e-03 1.313e-03 4.697 2.73e-06 ***
## total_sulfur_dioxide.density 9.763e-03 1.999e-03 4.884 1.08e-06 ***
## total_sulfur_dioxide.sulphates -1.480e-02 3.607e-03 -4.103 4.16e-05 ***
## density.pH -4.853e+02 5.434e+01 -8.931 < 2e-16 ***
## density.sulphates -1.631e+02 6.776e+01 -2.407 0.016139 *
## density.alcohol 1.355e+01 6.272e+00 2.160 0.030832 *
## pH.sulphates 2.459e+00 7.608e-01 3.233 0.001237 **
## pH.alcohol -3.189e-01 1.240e-01 -2.573 0.010119 *
## sulphates.alcohol -4.219e-01 1.418e-01 -2.976 0.002937 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.4936388)
##
## Null deviance: 3033.9 on 3915 degrees of freedom
## Residual deviance: 1912.4 on 3874 degrees of freedom
## AIC: 8392.4
##
## Number of Fisher Scoring iterations: 2