Suppose we have a data set with five predictors, \(X_1\) = GPA, \(X_2\) = IQ, \(X_3\) = Gender (1 for Female and 0 for Male), \(X_4\) = Interaction between GPA and IQ, and \(X_5\) = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get \(\hat{\beta}_0\) = 50, \(\hat{\beta}_1\) = 20, \(\hat{\beta}_2\) = 0.07, \(\hat{\beta}_3\) = 35, \(\hat{\beta}_4\) = 0.01, \(\hat{\beta}_5\) = -10.
Salary \(\approx\) 50 + 20 * GPA + 0.07 * IQ + 35 * Gender + 0.01 * GPA * IQ - 10 * GPA * Gender
For females, we add 35 - 10 * GPA to the output.
3(a) Which answer is correct, and why?
3(a)i. For a fixed value of IQ and GPA, males earn more on average than females.
False; because females earn more when GPA is less than 3.5.
3(a)ii. For a fixed value of IQ and GPA, females earn more on average than males.
False; because males earn more when GPA is greater than 3.5.
3(a)iii. For a fixed value of IQ and GPA, males earn more than females provided that the GPA is high enough.
True [based on the description of the hypothetical model].
3(a)iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.
False; because males earn more when GPA is greater than 3.5.
3(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.
50 + 20 * 4 + 0.07 * 110 + 35 * 1 + 0.01 * 4 * 110 - 10 * 4 * 1 = 137.1
3(c) True or false: Since the coefficients for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
False.
We are not given a measure of standard error for the GPA/IQ interaction term, so we don’t know whether this is a statistically significant interaction.
There are also large differences in the scales for the GPA, IQ, and Salary variables (e.g. Salary is measured in thousands of dollars).
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.
set.seed(1)
13(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.
x = rnorm(100, mean = 0, sd = sqrt(1)) # standardDeviation = sqrt(variance)
13(b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution; i.e. a normal distribution with mean zero and variance 0.25.
eps = rnorm(100, mean = 0, sd = sqrt(0.25))
13(c) Using x and eps, generate a vector y according to the model Y = -1 + 0.5 * X + epsilon. What is the length of the vector y? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?
y = -1 + 0.5 * x + eps
length(y)
## [1] 100
\(\beta_0\) (no hat) is -1 and \(\beta_1\) (no hat) is 0.5
13(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
plot(x, y, typ = "p")
There appears to be a linear relationship, with a positive correlation between x and y.
13(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{\beta}_0\) and \(\hat{\beta}_1\) compare to \(\beta_0\) and \(\beta_1\)?
model1 = lm(y ~ x)
coef(model1)
## (Intercept) x
## -1.0188463 0.4994698
summary(model1)$adj.r.squared
## [1] 0.4619164
The value of \(\hat{\beta}_0\) is close to \(\beta_0\) and \(\hat{\beta}_1\) is close to \(\beta_1\), though the presence of noise (eps, the irreducible error) prevents a perfect fit.
13(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.
plot(x, y, typ = "p")
abline(model1, col = "red", lty = "dotted")
abline(a = -1, b = 0.5, col = "black", lty = "solid")
legend("bottomright", legend = c("truth", "estimate"), col = c("black", "red"), lty = c("solid", "dotted"))
13(g) Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit?
model2 = lm(y ~ poly(x, 2))
summary(model2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9644604 0.04790161 -20.134193 2.046895e-36
## poly(x, 2)1 4.4637471 0.47901614 9.318573 3.972802e-15
## poly(x, 2)2 -0.6720309 0.47901614 -1.402940 1.638275e-01
summary(model2)$adj.r.squared
## [1] 0.4671806
No; the \(\hat{\beta}_2\) coefficient is not statistically significant; i.e. Pr(>|t|) == 1.638275e-01 (which is greater than or equal to 0.05, a common rejection threshold).
Notice that the adjusted R squared estimate is slightly larger, implying this is a better model. In practice you should never use training data to evaluate a supervised learning model, as the more complex model will always look better because it is better able to fit noise in the training data [which means it’s better able to overfit the training data].
13(h) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model should remain the same. You can do this be decreasing the variance of the normal distribution used to generate the error term epsilon in (b). Describe your results.
eps3 = rnorm(100, mean = 0, sd = sqrt(0.1 * 0.25))
y3 = -1 + 0.5 * x + eps3
model3 = lm(y3 ~ x)
coef(model3)
## (Intercept) x
## -0.9956726 0.5033468
summary(model3)$adj.r.squared
## [1] 0.8831566
plot(x, y3, typ = "p")
abline(model3, col = "red", lty = "dotted")
abline(a = -1, b = 0.5, col = "black", lty = "solid")
legend("bottomright", legend = c("truth", "estimate"), col = c("black", "red"), lty = c("solid", "dotted"))
The model is slightly closer to the truth, because the noise has been decreased.
13(i) Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term epsilon in (b). Describe your results.
eps4 = rnorm(100, mean = 0, sd = sqrt(10 * 0.25))
y4 = -1 + 0.5 * x + eps4
model4 = lm(y4 ~ x)
coef(model4)
## (Intercept) x
## -0.9088230 0.4119526
summary(model4)$adj.r.squared
## [1] 0.04321329
plot(x, y4, typ = "p")
abline(model4, col = "red", lty = "dotted")
abline(a = -1, b = 0.5, col = "black", lty = "solid")
legend("bottomright", legend = c("truth", "estimate"), col = c("black", "red"), lty = c("solid", "dotted"))
The model is slightly further from the truth, because the noise has been increased.
13(j) What are the confidence intervals for \(\hat{\beta}_0\) and \(\hat{\beta}_1\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.
confint(model3) # less noisy
## 2.5 % 97.5 %
## (Intercept) -1.0285258 -0.9628195
## x 0.4668557 0.5398379
confint(model1) # original
## 2.5 % 97.5 %
## (Intercept) -1.1150804 -0.9226122
## x 0.3925794 0.6063602
confint(model4) # more noisy
## 2.5 % 97.5 %
## (Intercept) -1.22347808 -0.5941680
## x 0.06245475 0.7614505
Noisier data produces larger confidence intervals for the coefficients.
set.seed(2^17 - 1)
start.time = Sys.time()
namelist = c("CRIM", "ZN", "INDUS", "CHAS", "NOX",
"RM", "AGE", "DIS", "RAD", "TAX",
"PTRATIO", "B", "LSTAT")
trn_X = read.table("C:/Data/boston/trn_X.tsv", header = F)
trn_y = scan("C:/Data/boston/trn_y.txt")
tst_X = read.table("C:/Data/boston/tst_X.tsv", header = F)
p = ncol(trn_X)
augmented_trn_X = trn_X
augmented_tst_X = tst_X
for (i in 1:p) {
if (namelist[i] != "CHAS") { # indicator squared equals indicator
augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,i])
augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,i])
namelist = c(namelist, paste(namelist[i], "_2", sep = ""))
}
}
for (i in 1:(p-1)) {
for (j in (i+1):p) {
augmented_trn_X = cbind(augmented_trn_X, trn_X[,i] * trn_X[,j])
augmented_tst_X = cbind(augmented_tst_X, tst_X[,i] * tst_X[,j])
namelist = c(namelist, paste(namelist[i], "_", namelist[j], sep = ""))
}
}
colnames(augmented_trn_X) = namelist
colnames(augmented_tst_X) = namelist
max.p = 4 * p
library(leaps)
subsets = regsubsets(augmented_trn_X, trn_y, nvmax = max.p, method = "backward")
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
results = array(0, c(max.p, 3))
colnames(results) = list("Variable Count", "RMSE", "RMSE SD")
best.loss = Inf
best.subset = NA
for (i in 1:max.p) {
selected = which(summary(subsets)$outmat[i,] == "*")
seeds = vector(mode = "list", length = 151)
for (seed in 1:length(seeds)) {
seeds[[seed]] = sample.int(1000, 1)
}
trControl = trainControl("repeatedcv", number = 5, repeats = 30, seeds = seeds)
model = train(as.matrix(augmented_trn_X[,selected]), trn_y, method = "glm", trControl = trControl)
results[i,1] = i
results[i,2] = model$results[,"RMSE"]
results[i,3] = model$results[,"RMSESD"]
if (results[i,2] < best.loss) {
best.loss = results[i,2]
best.subset = selected
}
}
results[order(results[,2]),]
## Variable Count RMSE RMSE SD
## [1,] 44 3.041720 0.4336434
## [2,] 48 3.085106 0.4546096
## [3,] 43 3.097128 0.4564614
## [4,] 46 3.100123 0.4705534
## [5,] 41 3.106580 0.4402035
## [6,] 40 3.107019 0.3859190
## [7,] 49 3.123978 0.4838739
## [8,] 45 3.127216 0.5028376
## [9,] 42 3.135962 0.4484389
## [10,] 47 3.141370 0.4841320
## [11,] 39 3.146863 0.4443865
## [12,] 52 3.147151 0.5062934
## [13,] 50 3.158979 0.5399707
## [14,] 51 3.168408 0.4666219
## [15,] 37 3.169480 0.4415860
## [16,] 38 3.172303 0.4324888
## [17,] 33 3.183201 0.4429902
## [18,] 36 3.184305 0.4249970
## [19,] 32 3.209081 0.4689968
## [20,] 34 3.209453 0.3988872
## [21,] 35 3.211357 0.4509342
## [22,] 29 3.224136 0.4463275
## [23,] 28 3.234315 0.4615202
## [24,] 30 3.234722 0.4155489
## [25,] 31 3.242175 0.4057723
## [26,] 26 3.267324 0.5262039
## [27,] 24 3.280403 0.5110030
## [28,] 27 3.294351 0.4795884
## [29,] 25 3.310169 0.5055381
## [30,] 23 3.339239 0.4993059
## [31,] 22 3.371967 0.5228056
## [32,] 21 3.427450 0.5169409
## [33,] 20 3.501622 0.5602336
## [34,] 19 3.553209 0.5150714
## [35,] 18 3.574377 0.5430110
## [36,] 17 3.633872 0.5706598
## [37,] 16 3.682194 0.4740322
## [38,] 15 3.699163 0.5405543
## [39,] 14 3.735485 0.5331289
## [40,] 13 3.794203 0.5769693
## [41,] 12 3.879990 0.5655191
## [42,] 11 3.967029 0.6132986
## [43,] 10 3.967771 0.6084316
## [44,] 9 3.987313 0.5096325
## [45,] 8 4.105121 0.5403185
## [46,] 7 4.262022 0.6244565
## [47,] 6 4.590331 0.6147658
## [48,] 5 4.802424 0.6517435
## [49,] 4 4.916712 0.7576198
## [50,] 3 4.994474 0.7139213
## [51,] 2 5.367693 0.6666138
## [52,] 1 6.329975 0.6048502
trn = data.frame(y = trn_y, augmented_trn_X[,best.subset])
tst = data.frame(augmented_tst_X[,best.subset])
model = lm(y ~ ., data = trn)
# diagnostic plots: to check for ...
# non-linearity of the response-predictor relationship
# non-gaussian distribution of residuals
# non-constant variance of error terms
# high-leverage points
par(mfrow = c(2, 2))
plot(model)
predictions = predict(model, newdata = tst)
output = data.frame(Id = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/boston/predictions.csv", quote=F, row.names = F)
summary(model)
##
## Call:
## lm(formula = y ~ ., data = trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9448 -1.6434 0.0487 1.3429 18.6804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.757e+02 2.676e+01 -6.567 1.81e-10 ***
## INDUS -1.759e+00 6.929e-01 -2.539 0.011541 *
## CHAS 4.333e+01 6.245e+00 6.938 1.86e-11 ***
## NOX 5.174e+01 2.214e+01 2.337 0.020005 *
## RM 2.941e+01 2.427e+00 12.118 < 2e-16 ***
## AGE 7.736e-01 1.326e-01 5.834 1.21e-08 ***
## DIS -8.091e+00 1.963e+00 -4.121 4.68e-05 ***
## PTRATIO 7.184e+00 1.493e+00 4.811 2.21e-06 ***
## B 6.080e-02 1.444e-02 4.211 3.22e-05 ***
## INDUS_2 4.150e-02 8.091e-03 5.130 4.76e-07 ***
## DIS_2 4.735e-01 7.841e-02 6.039 3.88e-09 ***
## RAD_2 -1.504e-01 3.267e-02 -4.603 5.78e-06 ***
## TAX_2 -9.915e-05 2.472e-05 -4.011 7.36e-05 ***
## LSTAT_2 1.801e-02 4.194e-03 4.293 2.27e-05 ***
## CRIM_CHAS 2.103e+00 3.389e-01 6.205 1.51e-09 ***
## CRIM_RM 1.483e-01 3.488e-02 4.251 2.72e-05 ***
## CRIM_TAX -2.126e-03 4.057e-04 -5.241 2.73e-07 ***
## CRIM_LSTAT 1.815e-02 4.233e-03 4.288 2.32e-05 ***
## ZN_NOX -9.394e-01 2.729e-01 -3.442 0.000645 ***
## ZN_DIS -1.510e-02 6.336e-03 -2.383 0.017669 *
## ZN_TAX 4.437e-04 1.342e-04 3.307 0.001039 **
## ZN_B 9.334e-04 3.289e-04 2.838 0.004797 **
## INDUS_RM 2.267e-01 7.814e-02 2.901 0.003945 **
## INDUS_TAX 1.435e-03 5.245e-04 2.736 0.006535 **
## INDUS_PTRATIO -5.472e-02 2.601e-02 -2.104 0.036082 *
## CHAS_NOX -3.148e+01 5.170e+00 -6.089 2.92e-09 ***
## CHAS_RM -4.055e+00 7.308e-01 -5.549 5.59e-08 ***
## NOX_DIS 1.518e+01 2.510e+00 6.046 3.73e-09 ***
## NOX_PTRATIO -6.495e+00 1.284e+00 -5.059 6.74e-07 ***
## NOX_LSTAT 1.018e+00 3.417e-01 2.980 0.003078 **
## RM_AGE -6.775e-02 1.466e-02 -4.622 5.31e-06 ***
## RM_RAD -1.862e-01 4.099e-02 -4.543 7.58e-06 ***
## RM_TAX -9.877e-03 3.112e-03 -3.174 0.001636 **
## RM_PTRATIO -7.938e-01 1.343e-01 -5.909 8.00e-09 ***
## RM_LSTAT -1.607e-01 3.557e-02 -4.517 8.52e-06 ***
## AGE_RAD 1.706e-02 3.088e-03 5.526 6.32e-08 ***
## AGE_TAX -6.175e-04 1.613e-04 -3.829 0.000152 ***
## AGE_B -6.326e-04 1.609e-04 -3.933 0.000101 ***
## AGE_LSTAT -7.985e-03 2.068e-03 -3.862 0.000134 ***
## DIS_TAX -5.790e-03 1.889e-03 -3.065 0.002338 **
## DIS_PTRATIO -2.251e-01 7.551e-02 -2.981 0.003068 **
## DIS_LSTAT 1.178e-01 2.381e-02 4.946 1.17e-06 ***
## RAD_TAX 7.929e-03 1.627e-03 4.873 1.65e-06 ***
## RAD_LSTAT -4.097e-02 4.748e-03 -8.628 < 2e-16 ***
## TAX_PTRATIO 6.730e-03 1.478e-03 4.555 7.20e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.738 on 359 degrees of freedom
## Multiple R-squared: 0.9231, Adjusted R-squared: 0.9136
## F-statistic: 97.9 on 44 and 359 DF, p-value: < 2.2e-16
Sys.time() - start.time
## Time difference of 1.598514 mins