Chapter 3, Question 3

Suppose we have a data set with five predictors, \(X_1\) = GPA, \(X_2\) = IQ, \(X_3\) = Gender (1 for Female and 0 for Male), \(X_4\) = Interaction between GPA and IQ, and \(X_5\) = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get \(\hat{\beta}_0\) = 50, \(\hat{\beta}_1\) = 20, \(\hat{\beta}_2\) = 0.07, \(\hat{\beta}_3\) = 35, \(\hat{\beta}_4\) = 0.01, \(\hat{\beta}_5\) = -10.

Salary \(\approx\) 50 + 20 * GPA + 0.07 * IQ + 35 * Gender + 0.01 * GPA * IQ - 10 * GPA * Gender

For females, we add 35 - 10 * GPA to the output.

3(a) Which answer is correct, and why?

3(a)i. For a fixed value of IQ and GPA, males earn more on average than females.

False; because females earn more when GPA is less than 3.5.

3(a)ii. For a fixed value of IQ and GPA, females earn more on average than males.

False; because males earn more when GPA is greater than 3.5.

3(a)iii. For a fixed value of IQ and GPA, males earn more than females provided that the GPA is high enough.

True [based on the description of the hypothetical model].

3(a)iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

False; because males earn more when GPA is greater than 3.5.

3(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

50 + 20 * 4 + 0.07 * 110 + 35 * 1 + 0.01 * 4 * 110 - 10 * 4 * 1 = 137.1

3(c) True or false: Since the coefficients for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

False.

We are not given a measure of standard error for the GPA/IQ interaction term, so we don’t know whether this is a statistically significant interaction.

There are also large differences in the scales for the GPA, IQ, and Salary variables (e.g. Salary is measured in thousands of dollars).

Chapter 3, Question 13

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

set.seed(1)

13(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.

x = rnorm(100, mean = 0, sd = sqrt(1))    # standardDeviation = sqrt(variance)

13(b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution; i.e. a normal distribution with mean zero and variance 0.25.

eps = rnorm(100, mean = 0, sd = sqrt(0.25))

13(c) Using x and eps, generate a vector y according to the model Y = -1 + 0.5 * X + epsilon. What is the length of the vector y? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?

y = -1 + 0.5 * x + eps
length(y)
## [1] 100

\(\beta_0\) (no hat) is -1 and \(\beta_1\) (no hat) is 0.5

13(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

plot(x, y, typ = "p")

There appears to be a linear relationship, with a positive correlation between x and y.

13(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{\beta}_0\) and \(\hat{\beta}_1\) compare to \(\beta_0\) and \(\beta_1\)?

model1 = lm(y ~ x)
coef(model1)
## (Intercept)           x 
##  -1.0188463   0.4994698
summary(model1)$adj.r.squared
## [1] 0.4619164

The value of \(\hat{\beta}_0\) is close to \(\beta_0\) and \(\hat{\beta}_1\) is close to \(\beta_1\), though the presence of noise (eps, the irreducible error) prevents a perfect fit.

13(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

plot(x, y, typ = "p")
abline(model1, col = "red", lty = "dotted")
abline(a = -1, b = 0.5, col = "black", lty = "solid")
legend("bottomright", legend = c("truth", "estimate"), col = c("black", "red"), lty = c("solid", "dotted"))

13(g) Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit?

model2 = lm(y ~ poly(x, 2))
summary(model2)$coefficients
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -0.9644604 0.04790161 -20.134193 2.046895e-36
## poly(x, 2)1  4.4637471 0.47901614   9.318573 3.972802e-15
## poly(x, 2)2 -0.6720309 0.47901614  -1.402940 1.638275e-01
summary(model2)$adj.r.squared
## [1] 0.4671806

No; the \(\hat{\beta}_2\) coefficient is not statistically significant; i.e. Pr(>|t|) == 1.638275e-01 (which is greater than or equal to 0.05, a common rejection threshold).

Notice that the adjusted R squared estimate is slightly larger, implying this is a better model. In practice you should never use training data to evaluate a supervised learning model, as the more complex model will always look better because it is better able to fit noise in the training data [which means it’s better able to overfit the training data].

13(h) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model should remain the same. You can do this be decreasing the variance of the normal distribution used to generate the error term epsilon in (b). Describe your results.

eps3 = rnorm(100, mean = 0, sd = sqrt(0.1 * 0.25))
y3 = -1 + 0.5 * x + eps3
model3 = lm(y3 ~ x)
coef(model3)
## (Intercept)           x 
##  -0.9956726   0.5033468
summary(model3)$adj.r.squared
## [1] 0.8831566
plot(x, y3, typ = "p")
abline(model3, col = "red", lty = "dotted")
abline(a = -1, b = 0.5, col = "black", lty = "solid")
legend("bottomright", legend = c("truth", "estimate"), col = c("black", "red"), lty = c("solid", "dotted"))

The model is slightly closer to the truth, because the noise has been decreased.

13(i) Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term epsilon in (b). Describe your results.

eps4 = rnorm(100, mean = 0, sd = sqrt(10 * 0.25))
y4 = -1 + 0.5 * x + eps4
model4 = lm(y4 ~ x)
coef(model4)
## (Intercept)           x 
##  -0.9088230   0.4119526
summary(model4)$adj.r.squared
## [1] 0.04321329
plot(x, y4, typ = "p")
abline(model4, col = "red", lty = "dotted")
abline(a = -1, b = 0.5, col = "black", lty = "solid")
legend("bottomright", legend = c("truth", "estimate"), col = c("black", "red"), lty = c("solid", "dotted"))

The model is slightly further from the truth, because the noise has been increased.

13(j) What are the confidence intervals for \(\hat{\beta}_0\) and \(\hat{\beta}_1\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

confint(model3)    # less noisy
##                  2.5 %     97.5 %
## (Intercept) -1.0285258 -0.9628195
## x            0.4668557  0.5398379
confint(model1)    # original
##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x            0.3925794  0.6063602
confint(model4)    # more noisy
##                   2.5 %     97.5 %
## (Intercept) -1.22347808 -0.5941680
## x            0.06245475  0.7614505

Noisier data produces larger confidence intervals for the coefficients.

Kaggle: https://inclass.kaggle.com/c/ml210-boston

set.seed(2^17 - 1)
start.time = Sys.time()

namelist = c("CRIM", "ZN", "INDUS", "CHAS", "NOX",
             "RM", "AGE", "DIS", "RAD", "TAX",
             "PTRATIO", "B", "LSTAT")

trn_X = read.table("C:/Data/boston/trn_X.tsv", header = F)
trn_y = scan("C:/Data/boston/trn_y.txt")
tst_X = read.table("C:/Data/boston/tst_X.tsv", header = F)
p = ncol(trn_X)

augmented_trn_X = trn_X
augmented_tst_X = tst_X
for (i in 1:p) {
    if (namelist[i] != "CHAS") {    # indicator squared equals indicator
        augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,i])
        augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,i])
        namelist = c(namelist, paste(namelist[i], "_2", sep = ""))
    }
}
for (i in 1:(p-1)) {
    for (j in (i+1):p) {
        augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,j])
        augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,j])
        namelist = c(namelist, paste(namelist[i], "_", namelist[j], sep = ""))
    }
}
colnames(augmented_trn_X) = namelist
colnames(augmented_tst_X) = namelist

max.p = 4 * p
library(leaps)
subsets = regsubsets(augmented_trn_X, trn_y, nvmax = max.p, method = "backward")

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
results = array(0, c(max.p, 3))
colnames(results) = list("Variable Count", "RMSE", "RMSE SD")
best.loss = Inf
best.subset = NA
for (i in 1:max.p) {
    selected = which(summary(subsets)$outmat[i,] == "*")
    seeds = vector(mode = "list", length = 151)
    for (seed in 1:length(seeds)) {
        seeds[[seed]] = sample.int(1000, 1)
    }
    trControl = trainControl("repeatedcv", number = 5, repeats = 30, seeds = seeds)
    model = train(as.matrix(augmented_trn_X[,selected]), trn_y, method = "glm", trControl = trControl)
    results[i,1] = i
    results[i,2] = model$results[,"RMSE"]
    results[i,3] = model$results[,"RMSESD"]
    if (results[i,2] < best.loss) {
        best.loss = results[i,2]
        best.subset = selected
    }
}
results[order(results[,2]),]
##       Variable Count     RMSE   RMSE SD
##  [1,]             44 3.041720 0.4336434
##  [2,]             48 3.085106 0.4546096
##  [3,]             43 3.097128 0.4564614
##  [4,]             46 3.100123 0.4705534
##  [5,]             41 3.106580 0.4402035
##  [6,]             40 3.107019 0.3859190
##  [7,]             49 3.123978 0.4838739
##  [8,]             45 3.127216 0.5028376
##  [9,]             42 3.135962 0.4484389
## [10,]             47 3.141370 0.4841320
## [11,]             39 3.146863 0.4443865
## [12,]             52 3.147151 0.5062934
## [13,]             50 3.158979 0.5399707
## [14,]             51 3.168408 0.4666219
## [15,]             37 3.169480 0.4415860
## [16,]             38 3.172303 0.4324888
## [17,]             33 3.183201 0.4429902
## [18,]             36 3.184305 0.4249970
## [19,]             32 3.209081 0.4689968
## [20,]             34 3.209453 0.3988872
## [21,]             35 3.211357 0.4509342
## [22,]             29 3.224136 0.4463275
## [23,]             28 3.234315 0.4615202
## [24,]             30 3.234722 0.4155489
## [25,]             31 3.242175 0.4057723
## [26,]             26 3.267324 0.5262039
## [27,]             24 3.280403 0.5110030
## [28,]             27 3.294351 0.4795884
## [29,]             25 3.310169 0.5055381
## [30,]             23 3.339239 0.4993059
## [31,]             22 3.371967 0.5228056
## [32,]             21 3.427450 0.5169409
## [33,]             20 3.501622 0.5602336
## [34,]             19 3.553209 0.5150714
## [35,]             18 3.574377 0.5430110
## [36,]             17 3.633872 0.5706598
## [37,]             16 3.682194 0.4740322
## [38,]             15 3.699163 0.5405543
## [39,]             14 3.735485 0.5331289
## [40,]             13 3.794203 0.5769693
## [41,]             12 3.879990 0.5655191
## [42,]             11 3.967029 0.6132986
## [43,]             10 3.967771 0.6084316
## [44,]              9 3.987313 0.5096325
## [45,]              8 4.105121 0.5403185
## [46,]              7 4.262022 0.6244565
## [47,]              6 4.590331 0.6147658
## [48,]              5 4.802424 0.6517435
## [49,]              4 4.916712 0.7576198
## [50,]              3 4.994474 0.7139213
## [51,]              2 5.367693 0.6666138
## [52,]              1 6.329975 0.6048502
trn = data.frame(y = trn_y, augmented_trn_X[,best.subset])
tst = data.frame(augmented_tst_X[,best.subset])
model = lm(y ~ ., data = trn)

# diagnostic plots: to check for ...
# non-linearity of the response-predictor relationship
# non-gaussian distribution of residuals
# non-constant variance of error terms
# high-leverage points
par(mfrow = c(2, 2))
plot(model)

predictions = predict(model, newdata = tst)
output = data.frame(Id = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/boston/predictions.csv", quote=F, row.names = F)
summary(model)
## 
## Call:
## lm(formula = y ~ ., data = trn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9448 -1.6434  0.0487  1.3429 18.6804 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.757e+02  2.676e+01  -6.567 1.81e-10 ***
## INDUS         -1.759e+00  6.929e-01  -2.539 0.011541 *  
## CHAS           4.333e+01  6.245e+00   6.938 1.86e-11 ***
## NOX            5.174e+01  2.214e+01   2.337 0.020005 *  
## RM             2.941e+01  2.427e+00  12.118  < 2e-16 ***
## AGE            7.736e-01  1.326e-01   5.834 1.21e-08 ***
## DIS           -8.091e+00  1.963e+00  -4.121 4.68e-05 ***
## PTRATIO        7.184e+00  1.493e+00   4.811 2.21e-06 ***
## B              6.080e-02  1.444e-02   4.211 3.22e-05 ***
## INDUS_2        4.150e-02  8.091e-03   5.130 4.76e-07 ***
## DIS_2          4.735e-01  7.841e-02   6.039 3.88e-09 ***
## RAD_2         -1.504e-01  3.267e-02  -4.603 5.78e-06 ***
## TAX_2         -9.915e-05  2.472e-05  -4.011 7.36e-05 ***
## LSTAT_2        1.801e-02  4.194e-03   4.293 2.27e-05 ***
## CRIM_CHAS      2.103e+00  3.389e-01   6.205 1.51e-09 ***
## CRIM_RM        1.483e-01  3.488e-02   4.251 2.72e-05 ***
## CRIM_TAX      -2.126e-03  4.057e-04  -5.241 2.73e-07 ***
## CRIM_LSTAT     1.815e-02  4.233e-03   4.288 2.32e-05 ***
## ZN_NOX        -9.394e-01  2.729e-01  -3.442 0.000645 ***
## ZN_DIS        -1.510e-02  6.336e-03  -2.383 0.017669 *  
## ZN_TAX         4.437e-04  1.342e-04   3.307 0.001039 ** 
## ZN_B           9.334e-04  3.289e-04   2.838 0.004797 ** 
## INDUS_RM       2.267e-01  7.814e-02   2.901 0.003945 ** 
## INDUS_TAX      1.435e-03  5.245e-04   2.736 0.006535 ** 
## INDUS_PTRATIO -5.472e-02  2.601e-02  -2.104 0.036082 *  
## CHAS_NOX      -3.148e+01  5.170e+00  -6.089 2.92e-09 ***
## CHAS_RM       -4.055e+00  7.308e-01  -5.549 5.59e-08 ***
## NOX_DIS        1.518e+01  2.510e+00   6.046 3.73e-09 ***
## NOX_PTRATIO   -6.495e+00  1.284e+00  -5.059 6.74e-07 ***
## NOX_LSTAT      1.018e+00  3.417e-01   2.980 0.003078 ** 
## RM_AGE        -6.775e-02  1.466e-02  -4.622 5.31e-06 ***
## RM_RAD        -1.862e-01  4.099e-02  -4.543 7.58e-06 ***
## RM_TAX        -9.877e-03  3.112e-03  -3.174 0.001636 ** 
## RM_PTRATIO    -7.938e-01  1.343e-01  -5.909 8.00e-09 ***
## RM_LSTAT      -1.607e-01  3.557e-02  -4.517 8.52e-06 ***
## AGE_RAD        1.706e-02  3.088e-03   5.526 6.32e-08 ***
## AGE_TAX       -6.175e-04  1.613e-04  -3.829 0.000152 ***
## AGE_B         -6.326e-04  1.609e-04  -3.933 0.000101 ***
## AGE_LSTAT     -7.985e-03  2.068e-03  -3.862 0.000134 ***
## DIS_TAX       -5.790e-03  1.889e-03  -3.065 0.002338 ** 
## DIS_PTRATIO   -2.251e-01  7.551e-02  -2.981 0.003068 ** 
## DIS_LSTAT      1.178e-01  2.381e-02   4.946 1.17e-06 ***
## RAD_TAX        7.929e-03  1.627e-03   4.873 1.65e-06 ***
## RAD_LSTAT     -4.097e-02  4.748e-03  -8.628  < 2e-16 ***
## TAX_PTRATIO    6.730e-03  1.478e-03   4.555 7.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.738 on 359 degrees of freedom
## Multiple R-squared:  0.9231, Adjusted R-squared:  0.9136 
## F-statistic:  97.9 on 44 and 359 DF,  p-value: < 2.2e-16
Sys.time() - start.time
## Time difference of 1.598514 mins