Homework #4

Chapter 5, Question 2

We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.

2(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.

The probability that the jth observation is the first bootstrap sample is 1/n, so the probability that the jth observation is not the first bootstrap sample is 1 - 1/n.

2(b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?

The probability that the jth observation is the second bootstrap sample is 1/n, so the probability that the jth observation is not the second bootstrap sample is 1 - 1/n.

2(c) Argue that the probability that the jth observation is not in the bootstrap sample is (1 - 1/n)^n.

As we are using sampling with replacement to generate the bootstrap sample, the selection probabilities are independent; so Probability(Observation j is not the first bootstrap observation, Observation j is not the second bootstrap observation, …, Observation j is not the nth bootstrap observation) = Probability(Observation j is not the first bootstrap observation) * Probability(Observation j is not the second bootstrap observation) * … * Probability(Observation j is not the nth bootstrap observation).

2(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?

For n = 5, 1 - (1 - 1/n)^n = 1 - (1 - 1/5)^5 = 0.67232

2(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?

For n = 100, 1 - (1 - 1/n)^n = 1 - (1 - 1/100)^100 = 0.6339677

2(f) When n = 10,000, what is the probability that the jth observation is in the bootstrap sample?

For n = 10000, 1 - (1 - 1/n)^n = 1 - (1 - 1/10000)^10000 = 0.632139

2(g) Create a plot that displays, for each integer value of n from 1 to 100,000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.

n = 1:100000
plot(n, 1 - (1 - 1/n)^n, typ = "l", log = "x")
abline(h = 1 - exp(-1), lty = "dotted")

The limit of (1 - 1/n)^n as n goes to infinity is exp(-1); so as n gets larger we see 1 - (1 - 1/n)^n approach 1 - exp(-1), which is approximately equal to 0.632.

2(h) We will now investigate numerically the probability that a bootstrap sample of size n = 100 contains the jth observation. Here j = 4. We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample.

n = 10000
store = rep(NA, n)
for (i in 1:n) {
    store[i] = sum(sample(1:100, rep = T) == 4) > 0
}
mean(store)

## [1] 0.633

Comment on the results obtained.

The mean(store) call yields the proportion of bootstrap samples that contain the 4th observation.

As expected, we see the selection probability is approximately equal to 1 - (1 - 1/100)^100, which is approximate equal to 1 - exp(-1).

Chapter 5, Question 8

We will now perform cross-validation on a simulated data set.

8(a) Generate a simulated data set as follows:

set.seed(1)
# y = rnorm(100)  # y is replaced before we use it
n = 100
x = rnorm(n)
y = x - 2 * (x^2) + rnorm(n)

In this data set, what is n and what is p? Write out the model used to generate the data in equation form.

n = 100
p = 2
Y = 0 + X - 2 * (X^2) + \(\epsilon\)

8(b) Create a scatterplot of X against Y. Comment on what you find.

plot(x, y, typ = "p")

As we might expect, the negative quadratic term generates an upside down bowl (parabola) [called quadratic because a square has four sides].

8(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:

8(c)i. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\epsilon\)
8(c)ii. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\beta_2\) * \(X^2\) + \(\epsilon\)
8(c)iii. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\beta_2\) * \(X^2\) + \(\beta_3\) * \(X^3\) + \(\epsilon\)
8(c)iv. \(Y\) = \(\beta_0\) + \(\beta_1\) * \(X\) + \(\beta_2\) * \(X^2\) + \(\beta_3\) * \(X^3\) + \(\beta_4\) * \(X^4\) + \(\epsilon\)

Note you may find it helpful to use the data.frame() function to create a single data set containing both X and Y.

seed = sample.int(2147483647, size = 1)
seed

## [1] 1414928891

set.seed(seed)
library(boot)
Simulation = data.frame(y = y, x = x)
model1.linear = glm(y ~ poly(x,1), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.linear)$delta[1]

## [1] 7.288162

model1.quadratic = glm(y ~ poly(x,2), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.quadratic)$delta[1]

## [1] 0.9374236

model1.cubic = glm(y ~ poly(x,3), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.cubic)$delta[1]

## [1] 0.9566218

model1.quartic = glm(y ~ poly(x,4), data = Simulation, family = gaussian)
cv.glm(Simulation, model1.quartic)$delta[1]

## [1] 0.9539049

8(d) Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?

seed = sample.int(2147483647, size = 1)
seed

## [1] 2044045886

set.seed(seed)
model2.linear = glm(y ~ poly(x,1), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.linear)$delta[1]

## [1] 7.288162

model2.quadratic = glm(y ~ poly(x,2), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.quadratic)$delta[1]

## [1] 0.9374236

model2.cubic = glm(y ~ poly(x,3), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.cubic)$delta[1]

## [1] 0.9566218

model2.quartic = glm(y ~ poly(x,4), data = Simulation, family = gaussian)
cv.glm(Simulation, model2.quartic)$delta[1]

## [1] 0.9539049

Yes; the results are identical to (c), because leave-one-out cross-validation uses all possible train/validation splits where 1 observation is used for validation and the other n-1 observations are used for training.

8(e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.

The quadratic model has the smallest LOOCV. Yes; this is what was expected, because the quadratic model matches the model used to generate the data [and hold out data is being used for validation].

8(f) Comment on the statistical significance of the coefficient estimates that results from your fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?

summary(model1.quadratic)

## 
## Call:
## glm(formula = y ~ poly(x, 2), family = gaussian, data = Simulation)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9650  -0.6254  -0.1288   0.5803   2.2700  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5500     0.0958  -16.18  < 2e-16 ***
## poly(x, 2)1   6.1888     0.9580    6.46 4.18e-09 ***
## poly(x, 2)2 -23.9483     0.9580  -25.00  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9178258)
## 
##     Null deviance: 700.852  on 99  degrees of freedom
## Residual deviance:  89.029  on 97  degrees of freedom
## AIC: 280.17
## 
## Number of Fisher Scoring iterations: 2

mean(Simulation$y)

## [1] -1.550023

confint(model1.linear)

## Waiting for profiling to be done...

##                 2.5 %    97.5 %
## (Intercept) -2.059640 -1.040405
## poly(x, 1)   1.092648 11.285003

confint(model1.quadratic)

## Waiting for profiling to be done...

##                  2.5 %     97.5 %
## (Intercept)  -1.737793  -1.362252
## poly(x, 2)1   4.311117   8.066534
## poly(x, 2)2 -25.826014 -22.070596

confint(model1.cubic)

## Waiting for profiling to be done...

##                  2.5 %     97.5 %
## (Intercept)  -1.738695  -1.361350
## poly(x, 3)1   4.302102   8.075549
## poly(x, 3)2 -25.835029 -22.061581
## poly(x, 3)3  -1.622618   2.150829

confint(model1.quartic)

## Waiting for profiling to be done...

##                   2.5 %     97.5 %
## (Intercept)  -1.7379932  -1.362052
## poly(x, 4)1   4.3091193   8.068532
## poly(x, 4)2 -25.8280112 -22.068599
## poly(x, 4)3  -1.6156006   2.143812
## poly(x, 4)4  -0.6226113   3.136801

As we would expect, the confidence intervals for the the coefficients of the linear and quadratic terms exclude zero (they’re statistically significant), while the confidence intervals for the coefficients of the cubic and quartic terms include zero (they’re not statistically significant). The confidence interval for the intercept also excludes zero, reflecting the mean of the response. The mean of the response for the population is -2, because the mean of a squared standard normal variable is 1.

Kaggle: https://inclass.kaggle.com/c/ml210-wine

set.seed(2^17 - 1)
options(width = 100)
start.time = Sys.time()
namelist = c("fixed_acidity", "volatile_acidity", "citric_acid",
             "residual_sugar", "chlorides", "free_sulfur_dioxide",
             "total_sulfur_dioxide", "density", "pH",
             "sulphates", "alcohol")
trn_X = read.csv("C:/Data/wine/trn_X.csv", header = F)
trn_y = scan("C:/Data/wine/trn_y.txt")
tst_X = read.csv("C:/Data/wine/tst_X.csv", header = F)
trn_X = trn_X[-c(1780,3619),]    # removing an outlier and a
trn_y = trn_y[-c(1780,3619)]     # high-leverage observation
p = ncol(trn_X)
augmented_trn_X = trn_X
augmented_tst_X = tst_X
for (i in 1:p) {
    augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,i])
    augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,i])
    namelist = c(namelist, paste(namelist[i], "_2", sep = ""))
}
for (i in 1:(p-1)) {
    for (j in (i+1):p) {
        augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,j])
        augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,j])
        namelist = c(namelist, paste(namelist[i], ".", namelist[j], sep = ""))
    }
}
colnames(augmented_trn_X) = namelist
colnames(augmented_tst_X) = namelist

max.p = ncol(augmented_trn_X)
library(leaps)
subsets = regsubsets(augmented_trn_X, trn_y, nvmax = max.p, method = "backward")

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## Loading required package: ggplot2

results = array(0, c(max.p, 3))
colnames(results) = list("Variable Count", "RMSE", "RMSE SD")
best.loss = Inf
best.subset = NA
best.model = NA
for (i in 1:max.p) {
    selected = which(summary(subsets)$outmat[i,] == "*")
    seeds = vector(mode = "list", length = 151)
    for (seed in 1:length(seeds)) {
        seeds[[seed]] = sample.int(1000, 1)
    }
    trControl = trainControl("repeatedcv", number = 5, repeats = 30, seeds = seeds)
    model = train(as.matrix(augmented_trn_X[,selected]), trn_y, method = "glm", trControl = trControl)
    results[i,1] = i
    results[i,2] = model$results[,"RMSE"]
    results[i,3] = model$results[,"RMSESD"]
    if (results[i,2] < best.loss) {
        best.loss = results[i,2]
        best.subset = selected
        best.model = model
    }
}
results[order(results[,2]),]

##       Variable Count      RMSE    RMSE SD
##  [1,]             41 0.7085485 0.01871094
##  [2,]             36 0.7092587 0.01949993
##  [3,]             34 0.7093678 0.01824183
##  [4,]             33 0.7094185 0.01857453
##  [5,]             40 0.7094698 0.01781984
##  [6,]             35 0.7097984 0.01873566
##  [7,]             30 0.7098309 0.01940009
##  [8,]             31 0.7101589 0.01666739
##  [9,]             28 0.7102652 0.01862940
## [10,]             32 0.7104037 0.01907901
## [11,]             38 0.7104527 0.01905178
## [12,]             37 0.7104535 0.01948504
## [13,]             42 0.7108457 0.02281870
## [14,]             27 0.7110959 0.01815007
## [15,]             29 0.7111491 0.02064502
## [16,]             26 0.7113752 0.01857074
## [17,]             25 0.7114544 0.01884820
## [18,]             24 0.7119206 0.01580491
## [19,]             22 0.7122586 0.01966434
## [20,]             59 0.7125025 0.01640043
## [21,]             23 0.7125048 0.01794292
## [22,]             60 0.7126336 0.01778928
## [23,]             39 0.7131046 0.02285295
## [24,]             61 0.7131908 0.01865447
## [25,]             21 0.7133921 0.01866756
## [26,]             18 0.7135498 0.01864134
## [27,]             58 0.7138980 0.02168585
## [28,]             67 0.7141555 0.01903082
## [29,]             17 0.7141823 0.01851187
## [30,]             65 0.7142260 0.02231889
## [31,]             20 0.7142270 0.01721641
## [32,]             68 0.7142679 0.01931216
## [33,]             51 0.7142817 0.02142288
## [34,]             48 0.7143165 0.02361784
## [35,]             47 0.7143901 0.02421665
## [36,]             62 0.7145024 0.02325654
## [37,]             19 0.7146481 0.01856070
## [38,]             64 0.7146602 0.01749679
## [39,]             46 0.7148674 0.02278700
## [40,]             16 0.7149832 0.01728402
## [41,]             63 0.7150604 0.02019288
## [42,]             57 0.7151491 0.02779784
## [43,]             45 0.7151819 0.02681959
## [44,]             52 0.7151943 0.02353478
## [45,]             66 0.7155000 0.01984962
## [46,]             56 0.7157937 0.02381194
## [47,]             15 0.7158139 0.01815106
## [48,]             53 0.7161532 0.02548414
## [49,]             70 0.7161900 0.02086754
## [50,]             69 0.7162934 0.02257229
## [51,]             54 0.7164297 0.02502509
## [52,]             74 0.7168621 0.01932120
## [53,]             43 0.7169187 0.02938766
## [54,]             55 0.7170318 0.02431918
## [55,]             73 0.7173242 0.02077185
## [56,]             71 0.7174705 0.02051052
## [57,]             14 0.7174737 0.01922648
## [58,]             75 0.7177734 0.02095710
## [59,]             44 0.7179741 0.02572717
## [60,]             50 0.7181618 0.02597910
## [61,]             72 0.7182445 0.02120690
## [62,]             13 0.7212470 0.01764907
## [63,]             49 0.7213231 0.03786333
## [64,]             76 0.7214024 0.03062883
## [65,]             77 0.7225585 0.02832052
## [66,]             12 0.7241923 0.01607927
## [67,]             11 0.7268640 0.01931085
## [68,]             10 0.7298327 0.01559887
## [69,]              9 0.7312996 0.01837394
## [70,]              8 0.7382329 0.01852615
## [71,]              7 0.7394196 0.01958222
## [72,]              6 0.7428632 0.01765997
## [73,]              5 0.7498079 0.01789484
## [74,]              4 0.7586798 0.01893734
## [75,]              3 0.7603470 0.01915802
## [76,]              2 0.7650833 0.01808504
## [77,]              1 0.8647357 0.01701741

predictions = predict(best.model, newdata = as.matrix(augmented_tst_X[,best.subset]))
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/wine/predictions.csv", quote=F, row.names = F)
Sys.time() - start.time

## Time difference of 10.90899 mins

summary(best.model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9924  -0.4829  -0.0186   0.4355   3.0043  
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                               1.159e+04  1.357e+03   8.542  < 2e-16 ***
## volatile_acidity                         -1.550e+01  2.655e+00  -5.839 5.68e-09 ***
## citric_acid                               8.558e+01  3.667e+01   2.334 0.019651 *  
## chlorides                                -1.146e+03  4.747e+02  -2.414 0.015824 *  
## free_sulfur_dioxide                      -3.417e+00  9.062e-01  -3.770 0.000165 ***
## density                                  -2.440e+04  2.788e+03  -8.751  < 2e-16 ***
## pH                                        4.696e+02  5.413e+01   8.676  < 2e-16 ***
## sulphates                                 1.593e+02  6.769e+01   2.354 0.018628 *  
## alcohol                                  -1.308e+01  6.422e+00  -2.036 0.041812 *  
## fixed_acidity_2                          -3.143e-02  9.920e-03  -3.168 0.001544 ** 
## citric_acid_2                            -1.009e+00  3.077e-01  -3.278 0.001054 ** 
## residual_sugar_2                         -4.042e-03  6.379e-04  -6.336 2.63e-10 ***
## chlorides_2                               1.520e+01  7.452e+00   2.040 0.041465 *  
## free_sulfur_dioxide_2                    -1.263e-04  3.444e-05  -3.668 0.000248 ***
## density_2                                 1.285e+04  1.437e+03   8.941  < 2e-16 ***
## pH_2                                      1.771e+00  4.313e-01   4.107 4.09e-05 ***
## alcohol_2                                 2.491e-02  1.361e-02   1.830 0.067325 .  
## fixed_acidity.density                    -1.120e+00  4.778e-01  -2.344 0.019129 *  
## fixed_acidity.pH                          5.056e-01  1.230e-01   4.110 4.03e-05 ***
## fixed_acidity.sulphates                   2.591e-01  1.404e-01   1.846 0.065013 .  
## volatile_acidity.pH                       2.818e+00  8.180e-01   3.445 0.000578 ***
## volatile_acidity.alcohol                  4.719e-01  9.370e-02   5.036 4.97e-07 ***
## citric_acid.density                      -8.903e+01  3.633e+01  -2.450 0.014312 *  
## citric_acid.pH                            1.179e+00  6.775e-01   1.740 0.081938 .  
## residual_sugar.chlorides                 -9.352e-01  2.418e-01  -3.867 0.000112 ***
## residual_sugar.free_sulfur_dioxide       -1.271e-03  3.747e-04  -3.394 0.000697 ***
## residual_sugar.pH                         7.402e-02  6.053e-03  12.230  < 2e-16 ***
## chlorides.density                         1.194e+03  4.778e+02   2.499 0.012485 *  
## chlorides.pH                             -1.012e+01  4.543e+00  -2.228 0.025956 *  
## chlorides.sulphates                      -1.742e+01  5.667e+00  -3.074 0.002124 ** 
## free_sulfur_dioxide.total_sulfur_dioxide -9.279e-05  2.177e-05  -4.262 2.07e-05 ***
## free_sulfur_dioxide.density               3.396e+00  9.035e-01   3.759 0.000173 ***
## free_sulfur_dioxide.sulphates             2.943e-02  7.069e-03   4.163 3.21e-05 ***
## free_sulfur_dioxide.alcohol               6.166e-03  1.313e-03   4.697 2.73e-06 ***
## total_sulfur_dioxide.density              9.763e-03  1.999e-03   4.884 1.08e-06 ***
## total_sulfur_dioxide.sulphates           -1.480e-02  3.607e-03  -4.103 4.16e-05 ***
## density.pH                               -4.853e+02  5.434e+01  -8.931  < 2e-16 ***
## density.sulphates                        -1.631e+02  6.776e+01  -2.407 0.016139 *  
## density.alcohol                           1.355e+01  6.272e+00   2.160 0.030832 *  
## pH.sulphates                              2.459e+00  7.608e-01   3.233 0.001237 ** 
## pH.alcohol                               -3.189e-01  1.240e-01  -2.573 0.010119 *  
## sulphates.alcohol                        -4.219e-01  1.418e-01  -2.976 0.002937 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.4936388)
## 
##     Null deviance: 3033.9  on 3915  degrees of freedom
## Residual deviance: 1912.4  on 3874  degrees of freedom
## AIC: 8392.4
## 
## Number of Fisher Scoring iterations: 2

Homework #4

ddebarr@uw.edu

February 9, 2017

Chapter 5, Question 2

Chapter 5, Question 8

Kaggle: https://inclass.kaggle.com/c/ml210-wine