Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was mean(X | Dividend = yes) = 10, while the mean for those that didn’t was mean(X | Dividend = no) = 0. In addition, the variance of X for these two sets of companies was \(\hat{\sigma}^2\) = 36. Finally, 80% of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company wil issue a dividend this year given that its percentage profit was X = 4 last year.
Hint: Recall that the density function for a normal random variable is
\[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}}exp\left(-\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2}\right)\]
Hint about the hint: the dnorm() function has the same definition as f() above, though you could define your own density function as follows:
density = function(x, mu, sigma) {
return(
(1 / sqrt(2 * pi * (sigma^2))) *
exp(-(1/2) * ((x - mu)^2) / (sigma^2))
)
}
prior = Probability(Dividend = yes) = 0.8
likelihood = density(X = 4 | Mean = 10, Variance = 36) = dnorm(4, mean = 10, sd = 6) = 0.0403285
evidence = sum of prior * likelihood for all classes = 0.8 * dnorm(4, mean = 10, sd = 6) + (1 - 0.8) * dnorm(4, mean = 0, sd = 6) = 0.042911
posterior = prior * likelihood / evidence = 0.7518525
This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
10(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
library(ISLR)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
plot(Weekly$Year, Weekly$Volume, typ = "p")
plot(Weekly$Year, Weekly$Today, typ = "p")
abline(h = 0, lty = "dotted")
boxplot(Lag1 ~ Direction, data = Weekly, ylim = c(-6, 6), ylab = "Lag1")
abline(h = 0, lty = "dotted")
boxplot(Lag2 ~ Direction, data = Weekly, ylim = c(-6, 6), ylab = "Lag2")
abline(h = 0, lty = "dotted")
boxplot(Volume ~ Direction, data = Weekly, ylab = "Volume")
abline(h = 1, lty = "dotted")
boxplot(Today ~ Direction, data = Weekly, ylim = c(-6, 6), ylab = "Today")
abline(h = 0, lty = "dotted")
Volume has increased over time. The dispersion of Today varies over time. The distribution of Lag1 differs for the two values of Direction, as does the distribution of Lag2.
Today should not be used to predict Direction, as Direction is defined by the value of Today.
10(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
model.lr1 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(model.lr1)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Only the coefficient for the Lag2 predictor is statistically significant, and the p value (probability of false reject) is still larger than 0.01.
10(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logisitic regression.
Objecting to answering this question is perfectly acceptable, because they’re implying we should use the training data for evaluation. [Reminder: we want to avoid using the training data for evaluation, as more complex models will appear to have better performance (because they are more likely to overfit the training data … something we want to avoid)].
If you do answer this question …
probabilities.lr1 = predict(model.lr1, newdata = Weekly, type = "response")
class.lr1 = levels(Weekly$Direction)[as.integer(probabilities.lr1 > 0.5) + 1]
confusion.lr1 = table(Weekly$Direction, class.lr1, dnn = c("Actual", "Prediction"))
accuracy.lr1 = mean(Weekly$Direction == class.lr1)
confusion.lr1
## Prediction
## Actual Down Up
## Down 54 430
## Up 48 557
accuracy.lr1
## [1] 0.5610652
binom.test(sum(diag(confusion.lr1)), sum(confusion.lr1))
##
## Exact binomial test
##
## data: sum(diag(confusion.lr1)) and sum(confusion.lr1)
## number of successes = 611, number of trials = 1089, p-value =
## 6.212e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.5310002 0.5907994
## sample estimates:
## probability of success
## 0.5610652
mean(Weekly$Direction == "Up")
## [1] 0.5555556
Unfortunately, we cannot reject the null hypothesis that states the accuracy of this logistic regression model is less than or equal to the observed accuracy of the null model that always predicts Direction == “Up”.
If we consider Direction == “Up” to be the positive class, we see that we have way more false positives than false negatives. We should consider increasing the classification threshold.
10(d) Now fit the regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
trn = Weekly[Weekly$Year <= 2008,]
tst = Weekly[Weekly$Year > 2008,]
model.lr2 = glm(Direction ~ Lag2, data = trn, family = binomial)
probabilities.lr2 = predict(model.lr2, newdata = tst, type = "response")
class.lr2 = levels(Weekly$Direction)[as.integer(probabilities.lr2 > 0.5) + 1]
confusion.lr2 = table(tst$Direction, class.lr2, dnn = c("Actual", "Prediction"))
accuracy.lr2 = mean(tst$Direction == class.lr2)
confusion.lr2
## Prediction
## Actual Down Up
## Down 9 34
## Up 5 56
accuracy.lr2
## [1] 0.625
binom.test(sum(diag(confusion.lr2)), sum(confusion.lr2))
##
## Exact binomial test
##
## data: sum(diag(confusion.lr2)) and sum(confusion.lr2)
## number of successes = 65, number of trials = 104, p-value =
## 0.01384
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.5246597 0.7180252
## sample estimates:
## probability of success
## 0.625
mean(tst$Direction == "Up")
## [1] 0.5865385
10(e) Repeat (d) using LDA.
library(MASS)
model.lda = lda(Direction ~ Lag2, data = trn)
predictions.lda = predict(model.lda, newdata = tst)
confusion.lda = table(tst$Direction, predictions.lda$class)
accuracy.lda = mean(tst$Direction == predictions.lda$class)
confusion.lda
##
## Down Up
## Down 9 34
## Up 5 56
accuracy.lda
## [1] 0.625
10(f) Repeat (d) using QDA.
library(MASS)
model.qda = qda(Direction ~ Lag2, data = trn)
predictions.qda = predict(model.qda, newdata = tst)
confusion.qda = table(tst$Direction, predictions.qda$class)
accuracy.qda = mean(tst$Direction == predictions.qda$class)
confusion.qda
##
## Down Up
## Down 0 43
## Up 0 61
accuracy.qda
## [1] 0.5865385
10(g) Repeat (d) using KNN.
library(class)
predictions.knn = knn(as.matrix(trn$Lag2), as.matrix(tst$Lag2), trn$Direction, k = 1)
confusion.knn = table(tst$Direction, predictions.knn)
accuracy.knn = mean(tst$Direction == predictions.knn)
confusion.knn
## predictions.knn
## Down Up
## Down 21 22
## Up 29 32
accuracy.knn
## [1] 0.5096154
10(h) Which of these methods appears to provide the best results on this data?
The logistic regression model and the Linear Discriminant Analysis (LDA) model have the best performance, in terms of accuracy.
10(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of these methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the hold out data. Note that you should also experiment with values for K in the KNN classifier.
I would be surprised if it’s possible to produce a model that has a statistically significant improvement in accuracy.
set.seed(2^17 - 1)
trn_X = read.csv("C:/Data/cancer/trn_X.csv", header = F)[,c(6,8,15,21,22,24,29,30)]
trn_y = scan("C:/Data/cancer/trn_y.txt")
tst_X = read.csv("C:/Data/cancer/tst_X.csv", header = F)[,c(6,8,15,21,22,24,29,30)]
trn = data.frame(y = trn_y, trn_X)
tst = data.frame(tst_X)
model = glm(y ~ ., data = trn, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predictions = predict(model, newdata = tst, typ = "response")
output = data.frame(Id = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/cancer/predictions.csv", quote=F, row.names = F)
summary(model)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7183 -0.0778 -0.0167 0.0010 3.4903
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.14951 11.43675 -1.849 0.064421 .
## V6 -62.05059 23.80724 -2.606 0.009151 **
## V8 162.32615 42.23655 3.843 0.000121 ***
## V15 81.64537 158.35770 0.516 0.606151
## V21 -1.04831 1.32559 -0.791 0.429043
## V22 0.26327 0.05740 4.587 4.5e-06 ***
## V24 0.02122 0.01341 1.583 0.113447
## V29 14.79877 7.04030 2.102 0.035553 *
## V30 89.18657 37.72242 2.364 0.018065 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 607.283 on 455 degrees of freedom
## Residual deviance: 72.237 on 447 degrees of freedom
## AIC: 90.237
##
## Number of Fisher Scoring iterations: 9
In case you’re wondering how those 8 features were selected [a preview of upcoming material] …
start.time = Sys.time()
namelist = c("mean_radius", "mean_texture", "mean_perimeter", "mean_area",
"mean_smoothness", "mean_compactness", "mean_concavity", "mean_concave_points",
"mean_symmetry", "mean_fractal_dimension",
"stderr_radius", "stderr_texture", "stderr_perimeter", "stderr_area",
"stderr_smoothness", "stderr_compactness", "stderr_concavity", "stderr_concave_points",
"stderr_symmetry", "stderr_fractal_dimension",
"largest_radius", "largest_texture", "largest_perimeter", "largest_area",
"largest_smoothness", "largest_compactness", "largest_concavity", "largest_concave_points",
"largest_symmetry", "largest_fractal_dimension")
trn_X = read.csv("C:/Data/cancer/trn_X.csv", header = F)
colnames(trn_X) = namelist
trn_y = factor(scan("C:/Data/cancer/trn_y.txt"), levels = c(0, 1), labels = c("no", "yes"))
trn = data.frame(y = trn_y, trn_X)
tst_X = read.csv("C:/Data/cancer/tst_X.csv", header = F)
colnames(tst_X) = namelist
tst = data.frame(tst_X)
library(leaps)
p = ncol(trn_X)
subsets = regsubsets(trn_X, trn_y, nvmax = p)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
CustomLogLossFunction = function(data, lev = NULL, model = NULL) {
actuals = as.integer(data$obs == "yes")
predictions = data$yes
epsilon = 1e-15
# predictions in the interval [epsilon, 1 - epsilon] to avoid log(0) = - Inf
bounded.predictions = pmin(pmax(predictions, epsilon), 1 - epsilon)
logLoss = - mean(log(actuals * bounded.predictions + (1 - actuals) * (1 - bounded.predictions)))
return(c(CustomLogLoss = logLoss))
}
results = array(0, c(p, 3))
colnames(results) = list("Variable Count", "Log Loss", "Log Loss SD")
best.logloss = Inf
best.subset = NA
for (i in 1:30) {
selected = which(summary(subsets)$outmat[i,] == "*")
seeds = vector(mode = "list", length = 151)
for (seed in 1:length(seeds)) {
seeds[[seed]] = sample.int(1000, 1)
}
trControl = trainControl("repeatedcv", number = 5, repeats = 30, summaryFunction = CustomLogLossFunction,
classProbs = T, seeds = seeds)
model = train(as.matrix(trn_X[,selected]), trn_y, method = "glm", metric = "CustomLogLoss", maximize = F,
trControl = trControl, control = list(maxit = 50))
results[i,1] = i
results[i,2] = model$results[,"CustomLogLoss"]
results[i,3] = model$results[,"CustomLogLossSD"]
if (results[i,2] < best.logloss) {
best.logloss = results[i,2]
best.subset = selected
}
}
results[order(results[,2]),]
## Variable Count Log Loss Log Loss SD
## [1,] 8 0.09956697 0.05652417
## [2,] 6 0.10019421 0.05802291
## [3,] 4 0.10603973 0.06307199
## [4,] 3 0.10682561 0.06132861
## [5,] 5 0.10935121 0.07878080
## [6,] 7 0.11300095 0.16185626
## [7,] 10 0.11427725 0.06064494
## [8,] 9 0.13041304 0.22990441
## [9,] 2 0.14042774 0.05005728
## [10,] 11 0.15105519 0.26536407
## [11,] 12 0.15430151 0.20960599
## [12,] 13 0.17529756 0.29155750
## [13,] 14 0.19218631 0.34617482
## [14,] 1 0.23333676 0.05860504
## [15,] 15 0.26746099 0.51565267
## [16,] 17 0.29434639 0.51080264
## [17,] 16 0.33913809 0.65688437
## [18,] 18 0.72337627 0.99407079
## [19,] 19 0.87624554 1.01778728
## [20,] 21 1.17962178 1.05925492
## [21,] 20 1.18024549 1.09187311
## [22,] 22 1.63263363 1.02327504
## [23,] 23 1.66013401 0.86871414
## [24,] 24 1.73887119 0.84824845
## [25,] 25 1.79736086 0.81205799
## [26,] 26 1.87432031 0.73067635
## [27,] 30 1.97164855 0.86482424
## [28,] 28 2.01375593 0.83110833
## [29,] 29 2.01811639 0.77823154
## [30,] 27 2.07865231 1.32505322
paste(best.subset, collapse = ",")
## [1] "6,8,15,21,22,24,29,30"
trn = data.frame(y = trn_y, trn_X[,best.subset])
tst = data.frame(tst_X[,best.subset])
model = glm(y ~ ., data = trn, family = binomial)
summary(model)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7183 -0.0778 -0.0167 0.0010 3.4903
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.14951 11.43675 -1.849 0.064421 .
## mean_compactness -62.05059 23.80724 -2.606 0.009151 **
## mean_concave_points 162.32615 42.23655 3.843 0.000121 ***
## stderr_smoothness 81.64537 158.35770 0.516 0.606151
## largest_radius -1.04831 1.32559 -0.791 0.429043
## largest_texture 0.26327 0.05740 4.587 4.5e-06 ***
## largest_area 0.02122 0.01341 1.583 0.113447
## largest_symmetry 14.79877 7.04030 2.102 0.035553 *
## largest_fractal_dimension 89.18657 37.72242 2.364 0.018065 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 607.283 on 455 degrees of freedom
## Residual deviance: 72.237 on 447 degrees of freedom
## AIC: 90.237
##
## Number of Fisher Scoring iterations: 9
predictions = predict(model, newdata = tst, type = "response")
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/cancer/predictions.csv", quote=F, row.names = F)
Sys.time() - start.time
## Time difference of 1.574609 mins