Homework #3

Chapter 4, Question 7

Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was mean(X | Dividend = yes) = 10, while the mean for those that didn’t was mean(X | Dividend = no) = 0. In addition, the variance of X for these two sets of companies was \(\hat{\sigma}^2\) = 36. Finally, 80% of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company wil issue a dividend this year given that its percentage profit was X = 4 last year.

Hint: Recall that the density function for a normal random variable is

\[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}}exp\left(-\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2}\right)\]

Hint about the hint: the dnorm() function has the same definition as f() above, though you could define your own density function as follows:

density = function(x, mu, sigma) {  
  return(  
      (1 / sqrt(2 * pi * (sigma^2))) *  
      exp(-(1/2) * ((x - mu)^2) / (sigma^2))  
  )  
}

prior = Probability(Dividend = yes) = 0.8
likelihood = density(X = 4 | Mean = 10, Variance = 36) = dnorm(4, mean = 10, sd = 6) = 0.0403285
evidence = sum of prior * likelihood for all classes = 0.8 * dnorm(4, mean = 10, sd = 6) + (1 - 0.8) * dnorm(4, mean = 0, sd = 6) = 0.042911
posterior = prior * likelihood / evidence = 0.7518525

Chapter 4, Question 10

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

10(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

library(ISLR)
summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

plot(Weekly$Year, Weekly$Volume, typ = "p")

plot(Weekly$Year, Weekly$Today, typ = "p")
abline(h = 0, lty = "dotted")

boxplot(Lag1 ~ Direction, data = Weekly, ylim = c(-6, 6), ylab = "Lag1")
abline(h = 0, lty = "dotted")

boxplot(Lag2 ~ Direction, data = Weekly, ylim = c(-6, 6), ylab = "Lag2")
abline(h = 0, lty = "dotted")

boxplot(Volume ~ Direction, data = Weekly, ylab = "Volume")
abline(h = 1, lty = "dotted")

boxplot(Today ~ Direction, data = Weekly, ylim = c(-6, 6), ylab = "Today")
abline(h = 0, lty = "dotted")

Volume has increased over time. The dispersion of Today varies over time. The distribution of Lag1 differs for the two values of Direction, as does the distribution of Lag2.

Today should not be used to predict Direction, as Direction is defined by the value of Today.

10(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

model.lr1 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(model.lr1)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Only the coefficient for the Lag2 predictor is statistically significant, and the p value (probability of false reject) is still larger than 0.01.

10(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logisitic regression.

Objecting to answering this question is perfectly acceptable, because they’re implying we should use the training data for evaluation. [Reminder: we want to avoid using the training data for evaluation, as more complex models will appear to have better performance (because they are more likely to overfit the training data … something we want to avoid)].

If you do answer this question …

probabilities.lr1 = predict(model.lr1, newdata = Weekly, type = "response")
class.lr1 = levels(Weekly$Direction)[as.integer(probabilities.lr1 > 0.5) + 1]
confusion.lr1 = table(Weekly$Direction, class.lr1, dnn = c("Actual", "Prediction"))
accuracy.lr1 = mean(Weekly$Direction == class.lr1)
confusion.lr1

##       Prediction
## Actual Down  Up
##   Down   54 430
##   Up     48 557

accuracy.lr1

## [1] 0.5610652

binom.test(sum(diag(confusion.lr1)), sum(confusion.lr1))

## 
##  Exact binomial test
## 
## data:  sum(diag(confusion.lr1)) and sum(confusion.lr1)
## number of successes = 611, number of trials = 1089, p-value =
## 6.212e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5310002 0.5907994
## sample estimates:
## probability of success 
##              0.5610652

mean(Weekly$Direction == "Up")

## [1] 0.5555556

Unfortunately, we cannot reject the null hypothesis that states the accuracy of this logistic regression model is less than or equal to the observed accuracy of the null model that always predicts Direction == “Up”.

If we consider Direction == “Up” to be the positive class, we see that we have way more false positives than false negatives. We should consider increasing the classification threshold.

10(d) Now fit the regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

trn = Weekly[Weekly$Year <= 2008,]
tst = Weekly[Weekly$Year >  2008,]
model.lr2 = glm(Direction ~ Lag2, data = trn, family = binomial)
probabilities.lr2 = predict(model.lr2, newdata = tst, type = "response")
class.lr2 = levels(Weekly$Direction)[as.integer(probabilities.lr2 > 0.5) + 1]
confusion.lr2 = table(tst$Direction, class.lr2, dnn = c("Actual", "Prediction"))
accuracy.lr2 = mean(tst$Direction == class.lr2)
confusion.lr2

##       Prediction
## Actual Down Up
##   Down    9 34
##   Up      5 56

accuracy.lr2

## [1] 0.625

binom.test(sum(diag(confusion.lr2)), sum(confusion.lr2))

## 
##  Exact binomial test
## 
## data:  sum(diag(confusion.lr2)) and sum(confusion.lr2)
## number of successes = 65, number of trials = 104, p-value =
## 0.01384
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5246597 0.7180252
## sample estimates:
## probability of success 
##                  0.625

mean(tst$Direction == "Up")

## [1] 0.5865385

10(e) Repeat (d) using LDA.

library(MASS)
model.lda = lda(Direction ~ Lag2, data = trn)
predictions.lda = predict(model.lda, newdata = tst)
confusion.lda = table(tst$Direction, predictions.lda$class)
accuracy.lda = mean(tst$Direction == predictions.lda$class)
confusion.lda

##       
##        Down Up
##   Down    9 34
##   Up      5 56

accuracy.lda

## [1] 0.625

10(f) Repeat (d) using QDA.

library(MASS)
model.qda = qda(Direction ~ Lag2, data = trn)
predictions.qda = predict(model.qda, newdata = tst)
confusion.qda = table(tst$Direction, predictions.qda$class)
accuracy.qda = mean(tst$Direction == predictions.qda$class)
confusion.qda

##       
##        Down Up
##   Down    0 43
##   Up      0 61

accuracy.qda

## [1] 0.5865385

10(g) Repeat (d) using KNN.

library(class)
predictions.knn = knn(as.matrix(trn$Lag2), as.matrix(tst$Lag2), trn$Direction, k = 1)
confusion.knn = table(tst$Direction, predictions.knn)
accuracy.knn = mean(tst$Direction == predictions.knn)
confusion.knn

##       predictions.knn
##        Down Up
##   Down   21 22
##   Up     29 32

accuracy.knn

## [1] 0.5096154

10(h) Which of these methods appears to provide the best results on this data?

The logistic regression model and the Linear Discriminant Analysis (LDA) model have the best performance, in terms of accuracy.

10(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of these methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the hold out data. Note that you should also experiment with values for K in the KNN classifier.

I would be surprised if it’s possible to produce a model that has a statistically significant improvement in accuracy.

Kaggle: https://inclass.kaggle.com/c/ml210-cancer

set.seed(2^17 - 1)

trn_X = read.csv("C:/Data/cancer/trn_X.csv", header = F)[,c(6,8,15,21,22,24,29,30)]
trn_y = scan("C:/Data/cancer/trn_y.txt")
tst_X = read.csv("C:/Data/cancer/tst_X.csv", header = F)[,c(6,8,15,21,22,24,29,30)]

trn = data.frame(y = trn_y, trn_X)
tst = data.frame(tst_X)
model = glm(y ~ ., data = trn, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predictions = predict(model, newdata = tst, typ = "response")

output = data.frame(Id = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/cancer/predictions.csv", quote=F, row.names = F)
summary(model)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7183  -0.0778  -0.0167   0.0010   3.4903  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -21.14951   11.43675  -1.849 0.064421 .  
## V6          -62.05059   23.80724  -2.606 0.009151 ** 
## V8          162.32615   42.23655   3.843 0.000121 ***
## V15          81.64537  158.35770   0.516 0.606151    
## V21          -1.04831    1.32559  -0.791 0.429043    
## V22           0.26327    0.05740   4.587  4.5e-06 ***
## V24           0.02122    0.01341   1.583 0.113447    
## V29          14.79877    7.04030   2.102 0.035553 *  
## V30          89.18657   37.72242   2.364 0.018065 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 607.283  on 455  degrees of freedom
## Residual deviance:  72.237  on 447  degrees of freedom
## AIC: 90.237
## 
## Number of Fisher Scoring iterations: 9

In case you’re wondering how those 8 features were selected [a preview of upcoming material] …

start.time = Sys.time()
namelist = c("mean_radius", "mean_texture", "mean_perimeter", "mean_area",
             "mean_smoothness", "mean_compactness", "mean_concavity", "mean_concave_points",
             "mean_symmetry", "mean_fractal_dimension", 
             "stderr_radius", "stderr_texture", "stderr_perimeter", "stderr_area",
             "stderr_smoothness", "stderr_compactness", "stderr_concavity", "stderr_concave_points",
             "stderr_symmetry", "stderr_fractal_dimension", 
             "largest_radius", "largest_texture", "largest_perimeter", "largest_area",
             "largest_smoothness", "largest_compactness", "largest_concavity", "largest_concave_points",
             "largest_symmetry", "largest_fractal_dimension")
trn_X = read.csv("C:/Data/cancer/trn_X.csv", header = F)
colnames(trn_X) = namelist
trn_y = factor(scan("C:/Data/cancer/trn_y.txt"), levels = c(0, 1), labels = c("no", "yes"))
trn = data.frame(y = trn_y, trn_X)
tst_X = read.csv("C:/Data/cancer/tst_X.csv", header = F)
colnames(tst_X) = namelist
tst = data.frame(tst_X)
library(leaps)
p = ncol(trn_X)
subsets = regsubsets(trn_X, trn_y, nvmax = p)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

CustomLogLossFunction = function(data, lev = NULL, model = NULL) {
    actuals = as.integer(data$obs == "yes")
    predictions = data$yes
    epsilon = 1e-15
    # predictions in the interval [epsilon, 1 - epsilon] to avoid log(0) = - Inf
    bounded.predictions = pmin(pmax(predictions, epsilon), 1 - epsilon)
    logLoss = - mean(log(actuals * bounded.predictions + (1 - actuals) * (1 - bounded.predictions)))
    return(c(CustomLogLoss = logLoss))
}
results = array(0, c(p, 3))
colnames(results) = list("Variable Count", "Log Loss", "Log Loss SD")
best.logloss = Inf
best.subset = NA
for (i in 1:30) {
    selected = which(summary(subsets)$outmat[i,] == "*")
    seeds = vector(mode = "list", length = 151)
    for (seed in 1:length(seeds)) {
        seeds[[seed]] = sample.int(1000, 1)
    }
    trControl = trainControl("repeatedcv", number = 5, repeats = 30, summaryFunction = CustomLogLossFunction,
                             classProbs = T, seeds = seeds)
    model = train(as.matrix(trn_X[,selected]), trn_y, method = "glm", metric = "CustomLogLoss", maximize = F,
                  trControl = trControl, control = list(maxit = 50))
    results[i,1] = i
    results[i,2] = model$results[,"CustomLogLoss"]
    results[i,3] = model$results[,"CustomLogLossSD"]
    if (results[i,2] < best.logloss) {
        best.logloss = results[i,2]
        best.subset = selected
    }
}
results[order(results[,2]),]

##       Variable Count   Log Loss Log Loss SD
##  [1,]              8 0.09956697  0.05652417
##  [2,]              6 0.10019421  0.05802291
##  [3,]              4 0.10603973  0.06307199
##  [4,]              3 0.10682561  0.06132861
##  [5,]              5 0.10935121  0.07878080
##  [6,]              7 0.11300095  0.16185626
##  [7,]             10 0.11427725  0.06064494
##  [8,]              9 0.13041304  0.22990441
##  [9,]              2 0.14042774  0.05005728
## [10,]             11 0.15105519  0.26536407
## [11,]             12 0.15430151  0.20960599
## [12,]             13 0.17529756  0.29155750
## [13,]             14 0.19218631  0.34617482
## [14,]              1 0.23333676  0.05860504
## [15,]             15 0.26746099  0.51565267
## [16,]             17 0.29434639  0.51080264
## [17,]             16 0.33913809  0.65688437
## [18,]             18 0.72337627  0.99407079
## [19,]             19 0.87624554  1.01778728
## [20,]             21 1.17962178  1.05925492
## [21,]             20 1.18024549  1.09187311
## [22,]             22 1.63263363  1.02327504
## [23,]             23 1.66013401  0.86871414
## [24,]             24 1.73887119  0.84824845
## [25,]             25 1.79736086  0.81205799
## [26,]             26 1.87432031  0.73067635
## [27,]             30 1.97164855  0.86482424
## [28,]             28 2.01375593  0.83110833
## [29,]             29 2.01811639  0.77823154
## [30,]             27 2.07865231  1.32505322

paste(best.subset, collapse = ",")

## [1] "6,8,15,21,22,24,29,30"

trn = data.frame(y = trn_y, trn_X[,best.subset])
tst = data.frame(tst_X[,best.subset])
model = glm(y ~ ., data = trn, family = binomial)
summary(model)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7183  -0.0778  -0.0167   0.0010   3.4903  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -21.14951   11.43675  -1.849 0.064421 .  
## mean_compactness          -62.05059   23.80724  -2.606 0.009151 ** 
## mean_concave_points       162.32615   42.23655   3.843 0.000121 ***
## stderr_smoothness          81.64537  158.35770   0.516 0.606151    
## largest_radius             -1.04831    1.32559  -0.791 0.429043    
## largest_texture             0.26327    0.05740   4.587  4.5e-06 ***
## largest_area                0.02122    0.01341   1.583 0.113447    
## largest_symmetry           14.79877    7.04030   2.102 0.035553 *  
## largest_fractal_dimension  89.18657   37.72242   2.364 0.018065 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 607.283  on 455  degrees of freedom
## Residual deviance:  72.237  on 447  degrees of freedom
## AIC: 90.237
## 
## Number of Fisher Scoring iterations: 9

predictions = predict(model, newdata = tst, type = "response")
output = data.frame(Index = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/cancer/predictions.csv", quote=F, row.names = F)
Sys.time() - start.time

## Time difference of 1.574609 mins

Homework #3

ddebarr@uw.edu

February 2, 2017

Chapter 4, Question 7

Chapter 4, Question 10

Kaggle: https://inclass.kaggle.com/c/ml210-cancer