Homework #1

Chapter 2, Question 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

2(a): We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

regression (CEO salary)
inference (variable importance)
n = 500 observations (firms)
p = 3 predictors: profit, number of employees, industry

2(b): We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

classification (success or failure)
prediction
n = 20 observations (similar products)
p = 13 predictors: price, marketing budget, competition price, ten other variables

2(c): We are interest[ed] in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

regression (% change in the US dollar)
prediction
n = 52 observations (weekly changes)
p = 3 predictors: % change for US, British, and German markets

Chapter 2, Question 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

9(a): Which of the predictors are quantitative, and which are qualitative?
Note: The origin number encodes a geographic region; e.g. 1 = American.

library(ISLR)
Auto = na.omit(Auto)
# ?Auto

“variable”	“type”
mpg	quantitative
cylinders	quantitative
displacement	quantitative
horsepower	quantitative
weight	quantitative
acceleration	quantitative
year	quantitative
origin	qualitative
name	qualitative

9(b): What is the range of each quantitative predictor? You can answer this using the range() function?

apply(Auto[,1:7], 2, range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

9(c): What is the mean and standard deviation of each quantitative predictor?

options(width = 95)
apply(Auto[,1:7], 2, mean)

##          mpg    cylinders displacement   horsepower       weight acceleration         year 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327    75.979592

options(width = 95)
apply(Auto[,1:7], 2, sd)

##          mpg    cylinders displacement   horsepower       weight acceleration         year 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864     3.683737

9(d): Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset that remains?

apply(Auto[-c(10:85),1:7], 2, range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82

options(width = 95)
apply(Auto[-c(10:85),1:7], 2, mean)

##          mpg    cylinders displacement   horsepower       weight acceleration         year 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899    77.145570

apply(Auto[-c(10:85),1:7], 2, sd)

##          mpg    cylinders displacement   horsepower       weight acceleration         year 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721     3.106217

9(e): Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

We see both linear and non-linear relationships. For example, year and mpg appear to have a linear relationship, while horsepower and mpg appear to have a non-linear relationship.

pairs(Auto[,1:7])

options(width = 80)
round(cor(Auto[,1:7]), 2)

##                mpg cylinders displacement horsepower weight acceleration  year
## mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42  0.58
## cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50 -0.35
## displacement -0.81      0.95         1.00       0.90   0.93        -0.54 -0.37
## horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69 -0.42
## weight       -0.83      0.90         0.93       0.86   1.00        -0.42 -0.31
## acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00  0.29
## year          0.58     -0.35        -0.37      -0.42  -0.31         0.29  1.00

boxplot(mpg ~ origin, data = Auto)

# an alternative scatterplot matrix
panel.hist = function(x, ...) {
    usr = par("usr")
    on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5))
    h = hist(x, plot = F)
    breaks = h$breaks
    nB = length(breaks)
    y = h$counts
    y = y / max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
panel.cor = function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    usr = par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r = abs(cor(x, y))
    txt = format(c(r, 0.123456789), digits = digits)[1]
    txt = paste0(prefix, txt)
    if (missing(cex.cor)) {
        cex.cor = 0.8 / strwidth(txt)
    }
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(Auto[,1:7], lower.panel = panel.smooth, diag.panel = panel.hist, upper.panel = panel.cor)

9(f): Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Yes; we see variables with both positive and negative relationships to the mpg outcome. For example, year and mpg appear to have a positive relationship (as the year increases, the mpg also tends to increase), while horsepower and mpg appear to have a negative relationship (as the horsepower increases, the mpg tends to decrease).

Kaggle: https://inclass.kaggle.com/c/ml210-mnist

set.seed(2^17 - 1)
start.time = Sys.time()

trn_X = read.csv("C:/Data/mnist/trn_X.csv", header = F)
trn_y = scan("C:/Data/mnist/trn_y.txt")
tst_X = read.csv("C:/Data/mnist/tst_X.csv", header = F)

rotate = function(X) t(apply(X, 2, rev))
windows(height = 3, width = 3)
i = sample.int(nrow(trn_X), size = 1)
image(rotate(matrix(as.numeric(trn_X[i,]), nrow = 28, byrow = T)),
      col = gray.colors(256, 0, 1),
      main = trn_y[i], axes = F)

library(FNN)
predictions = knn(trn_X, tst_X, factor(trn_y), k = 3)
output = data.frame(Id = 1:length(predictions), Prediction = predictions)
write.csv(output, "C:/Data/mnist/predictions.csv", quote=F, row.names = F)
Sys.time() - start.time

## Time difference of 7.723298 mins

Homework #1

ddebarr@uw.edu

January 19, 2017

Chapter 2, Question 2

Chapter 2, Question 9

Kaggle: https://inclass.kaggle.com/c/ml210-mnist