### Part 2: Auto dataset revisited

We also used the auto dataset two weeks ago in __lab 6__. We used it with LDA and QDA. Both methods in R provide a CV argument that will compute a LOOCV estimate for us. If we want to compute a *k*-fold cross validation estimate when *k* is not equal to the number of instances, we have to either write our own code or find another library to use. Here we will write our own code! Write a function that accepts a dataframe, a model-building function (either lda or qda), and a value for K and returns an error estimate and its variance for *k*-fold cross validation. Use this function to generate values for the same kind of table you made in part 1. Compare these values to using the training set and a validation set to estimate the error rates, too. Finally, include a paragraph summarizing and explaining the results just as you did in part 1.

Note: The lab6 he is referring is attached here. That is where the auto dataset being analyzed

lab6_sln

Lab 6: Classification Part 1

Create an R markdown document for these tasks and hand it in and your knitted PDF for this assignment. Create numbered sections corresponding to each part of the assignment below. Within each section, describe your work in paragraphs with complete sentences, good grammar, etc. Be sure to address all questions and provide all requested tables or visualizations in your document.

Section 1: Statlog (heart) dataset

We’ll use the statlog (heart) dataset for this section. It’s a binary classification problem predicting the presence or absence of heart disease from 13 medical measurements. Go through the usual dance with UCI data to read it in from a URL and provide names for the features if the datafile does not include them as a header. The dataset has some nominal features that are encoded as numbers in the dataset from UCI, so be sure to re-code those as factors in R before building a model. We have done this before in lab 3, so re-use your code for that or my solution code if need be.

1. In your markdown document, include a tabular summary of the dataset and the correlation matrix for the numerical features. For any highly correlated pairs of features, include some individual scatterplots. It may be helpful to view the scatterplot matrix, but with so many features in the dataset, don’t include it in your markdown document; it would be too hard to see or read.

suppressMessages(suppressWarnings(library(tidyverse))) library(knitr)

heart <- read.table( ‘https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat’, header = F)

cnames <- c(‘age’, ‘sex’, ‘cpt’, ‘rbp’, ‘serum’, ‘fbs’, ‘rer’, ‘mhr’, ‘eia’, ‘oldpeak’, ‘slope’, ‘nmaj’, ‘thal’, ‘class’)

names(heart) <- cnames heart.recoded <- heart %>%

mutate(sex = recode_factor(sex, ‘0’ =’ male’, ‘1’ =’female’), cpt = recode_factor(cpt, ‘1’ = ‘typical angina’, ‘2’ = ‘atypical angina’,

‘3’ = ‘non-anginal pain’, ‘4’ = ‘asymptomatic’), fbs = recode_factor(fbs, ‘0’ = ‘TRUE’, ‘1’ = ‘FALSE’), rer = recode_factor(rer, ‘0’ = ‘normal’, ‘1’ = ‘ST-T abnormal’,

‘2’ = ‘left ventricular hypertrophy’), eia = recode_factor(eia, ‘1’ = ‘yes’, ‘0’ = ‘no’), slope = recode_factor(slope, ‘1’ = ‘up’, ‘2’ = ‘flat’, ‘3’ = ‘down’), thal = recode_factor(thal, ‘3’ = ‘normal’, ‘6’ = ‘fixed defect’,

‘7’ = ‘reversable defect’), class = recode_factor(class, ‘1’ = ‘absence’, ‘2’ = ‘presence’))

# necessary for glm to use it as a class feature heart.recoded$class = as.factor(heart.recoded$class)

1https://archive.ics.uci.edu/ml/datasets/statlog+(heart)

2. Build a logistic regression model and evaluate its overall accuracy. Is every feature a significant predictor in this model? Which features significantly increase the likelihood of heart disease and which significantly decrease that likelihood?

heart.fit <- glm(class~., data=heart.recoded, family=’binomial’)

summary(heart.fit)

## ## Call: ## glm(formula = class ~ ., family = “binomial”, data = heart.recoded) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6431 -0.4754 -0.1465 0.3342 2.8100 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.373279 3.126623 -2.038 0.04151 * ## age -0.016007 0.026394 -0.606 0.54421 ## sexfemale 1.763012 0.580761 3.036 0.00240 ** ## cptatypical angina 1.388830 0.892942 1.555 0.11986 ## cptnon-anginal pain 0.552734 0.747487 0.739 0.45963 ## cptasymptomatic 2.386128 0.756538 3.154 0.00161 ** ## rbp 0.026004 0.012080 2.153 0.03135 * ## serum 0.006621 0.004228 1.566 0.11732 ## fbsFALSE -0.370040 0.626396 -0.591 0.55469 ## rerST-T abnormal 0.647580 3.185422 0.203 0.83890 ## rerleft ventricular hypertrophy 0.633593 0.412073 1.538 0.12415 ## mhr -0.019337 0.011486 -1.683 0.09228 . ## eiano -0.596869 0.460540 -1.296 0.19497 ## oldpeak 0.449245 0.244631 1.836 0.06630 . ## slopeflat 0.949841 0.500425 1.898 0.05769 . ## slopedown 0.122787 1.041666 0.118 0.90617 ## nmaj 1.199839 0.280947 4.271 1.95e-05 *** ## thalfixed defect -0.146197 0.845517 -0.173 0.86272 ## thalreversable defect 1.431791 0.449873 3.183 0.00146 ** ## — ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 370.96 on 269 degrees of freedom ## Residual deviance: 168.90 on 251 degrees of freedom ## AIC: 206.9 ## ## Number of Fisher Scoring iterations: 6 contrasts(heart.recoded$class)

## presence ## absence 0 ## presence 1 predicted.probs <- predict(heart.fit, heart.recoded, type=”response”) predicted.classes <- transmute(data.frame(predicted.probs),

2

class = ifelse(predicted.probs > 0.5, ‘presence’, ‘absence’))

full.acc <- mean(predicted.classes$class == heart.recoded$class) print(full.acc)

## [1] 0.8777778

3. If there are insignificant features, build a model using a reduced feature set and compare the overall accuracy with that of the full model.

heart.reduced.fit <- glm(class~sex+cpt+rbp+nmaj, data=heart.recoded, family=’binomial’)

summary(heart.reduced.fit)

## ## Call: ## glm(formula = class ~ sex + cpt + rbp + nmaj, family = “binomial”, ## data = heart.recoded) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6236 -0.6095 -0.2216 0.6369 2.1803 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -7.335887 1.641526 -4.469 7.86e-06 *** ## sexfemale 1.823877 0.407172 4.479 7.49e-06 *** ## cptatypical angina 0.184704 0.760532 0.243 0.808113 ## cptnon-anginal pain 0.257661 0.671085 0.384 0.701018 ## cptasymptomatic 2.475208 0.646182 3.831 0.000128 *** ## rbp 0.028195 0.009601 2.937 0.003318 ** ## nmaj 1.133589 0.210552 5.384 7.29e-08 *** ## — ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 370.96 on 269 degrees of freedom ## Residual deviance: 227.15 on 263 degrees of freedom ## AIC: 241.15 ## ## Number of Fisher Scoring iterations: 5 predicted.probs <- predict(heart.reduced.fit, heart.recoded, type=”response”) predicted.classes <- transmute(data.frame(predicted.probs),

class = ifelse(predicted.probs > 0.5, ‘presence’, ‘absence’))

reduced.acc <- mean(predicted.classes$class == heart.recoded$class)

kable(data.frame( model=c(“full model accuracy”, “reduced model accuracy”), accuracy=c(full.acc, reduced.acc)))

model accuracy full model accuracy 0.8777778

3

model accuracy reduced model accuracy 0.7925926

4. Comment on the nominal features in the dataset. Are all levels significant predictors of heart disease? Which ones increase the likelihood of heart disease and which ones decrease the likelihood?

The two classes are absence and presence. The last one alphabetically gets coded as 1 by R, so positive coefficients increase the likelihood of heart disease and negative decrease the likelihood. To interpret all this more easily, I include contrasts below for each factor to make it easier to interpret. Being female increases the likelihood because of it’s positive coefficient. All the cpt coefficients are positive, so any kind of chest pain type with a 1 in the contrasts increases the likelihood. Only typical angina (all zeroes) doesn’t. For thal, fixed defect is negative while reversible is positive, so having a reversible defect increases the likelihood. Finally, for slope, both coefficients (flat and down) are positive, so either increase the likelihood. contrasts(heart.recoded$sex)

## female ## male 0 ## female 1 contrasts(heart.recoded$cpt)

## atypical angina non-anginal pain asymptomatic ## typical angina 0 0 0 ## atypical angina 1 0 0 ## non-anginal pain 0 1 0 ## asymptomatic 0 0 1 contrasts(heart.recoded$thal)

## fixed defect reversable defect ## normal 0 0 ## fixed defect 1 0 ## reversable defect 0 1 contrasts(heart.recoded$slope)

## flat down ## up 0 0 ## flat 1 0 ## down 0 1

Section 2: Auto dataset

This one is straight out of the textbook. It is problem 11 from Chapter 4. It is copy/pasted below:

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set. library(GGally)

## Registered S3 method overwritten by ‘GGally’: ## method from ## +.gg ggplot2

## ## Attaching package: ‘GGally’

4

## The following object is masked from ‘package:dplyr’: ## ## nasa library(ISLR) data(Auto)

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables. Kimmer’s comment: or just use dplyr!

mpg.med <- median(Auto$mpg) Auto <- Auto %>% mutate(mpg01 = ifelse(mpg > mpg.med, 1, 0)) %>%

select(-mpg) #Auto$mpg01 = as.numeric(Auto$mpg > mpg.med) # also works!

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

Auto <- Auto %>% select(-name) summary(Auto)

## cylinders displacement horsepower weight acceleration ## Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613 Min. : 8.00 ## 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:13.78 ## Median :4.000 Median :151.0 Median : 93.5 Median :2804 Median :15.50 ## Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978 Mean :15.54 ## 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:17.02 ## Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140 Max. :24.80 ## year origin mpg01 ## Min. :70.00 Min. :1.000 Min. :0.0 ## 1st Qu.:73.00 1st Qu.:1.000 1st Qu.:0.0 ## Median :76.00 Median :1.000 Median :0.5 ## Mean :75.98 Mean :1.577 Mean :0.5 ## 3rd Qu.:79.00 3rd Qu.:2.000 3rd Qu.:1.0 ## Max. :82.00 Max. :3.000 Max. :1.0 ggpairs(Auto, aes(color=as.factor(mpg01)))

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

5

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

Corr: 0.951*** 0: 0.905*** 1: 0.791***

Corr: 0.843*** 0: 0.751*** 1: 0.418***

Corr: 0.897*** 0: 0.842*** 1: 0.608***

Corr: 0.898*** 0: 0.813*** 1: 0.576***

Corr: 0.933*** 0: 0.853*** 1: 0.840***

Corr: 0.865*** 0: 0.769*** 1: 0.647***

Corr: −0.505*** 0: −0.607*** 1: 0.021

Corr: −0.544*** 0: −0.657*** 1: −0.059

Corr: −0.689*** 0: −0.766*** 1: −0.521***

Corr: −0.417*** 0: −0.428*** 1: 0.055

Corr: −0.346*** 0: −0.143* 1: 0.146*

Corr: −0.370*** 0: −0.251*** 1: 0.222**

Corr: −0.416*** 0: −0.331*** 1: −0.001

Corr: −0.309*** 0: −0.126. 1: 0.246***

Corr: 0.290*** 0: 0.353*** 1: 0.007

Corr: −0.569*** 0: −0.569*** 1: −0.184**

Corr: −0.615*** 0: −0.582*** 1: −0.411***

Corr: −0.455*** 0: −0.255*** 1: −0.179*

Corr: −0.585*** 0: −0.446*** 1: −0.365***

Corr: 0.213*** 0: 0.103 1: 0.010

Corr: 0.182*** 0: 0.062 1: −0.103

Corr: −0.759*** 0: NA 1: NA

Corr: −0.753*** 0: NA 1: NA

Corr: −0.667*** 0: NA 1: NA

Corr: −0.758*** 0: NA 1: NA

Corr: 0.347*** 0: NA 1: NA

Corr: 0.430*** 0: NA 1: NA

Corr: 0.514*** 0: NA 1: NA

cylinders displacement horsepower weight acceleration year origin mpg01 cylin d

e rsdisp

la ce

m e

n t

h o

rse p

o w

e rw

e ig

h ta

cce le

ra tio

n ye

a r

o rig

in m

p g

0 1

3 4 5 6 7 8 100200300400 50100150200 2000300040005000 10 15 20 2570.072.575.077.580.082.51.01.52.02.53.00.000.250.500.751.00

0.0 0.5 1.0 1.5

100 200 300 400

50 100 150 200

2000 3000 4000 5000

10 15 20 25

70.0 72.5 75.0 77.5 80.0 82.5

1.0 1.5 2.0 2.5 3.0

0.00 0.25 0.50 0.75 1.00

Note that the name of the car is not relevant to prediction, so I’ve dropped it. With ggpairs, I can look at the density plots colored by mpg01 and get essentially the same info as boxplots. I am looking for separation by class, and I see it in cylinders, displacement, horsepower, weight. Displacement and horsepower appear correlated from their scatterplot, so I could reduce this by 1 feature if I chose to. I choose to! To keep it simple, I’ll work with cylinders, displacement, and weight.

(c) Split the data into a training set and a test set. auto.dfs = list() Auto.new <- mutate(Auto, id=row_number()) auto.dfs$train <- sample_frac(Auto.new, 0.67) auto.dfs$test <- anti_join(Auto.new, auto.dfs$train, by = ‘id’)

auto.dfs$train <- dplyr::select(auto.dfs$train, -id) auto.dfs$test <- dplyr::select(auto.dfs$test, -id)

6

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(MASS)

## ## Attaching package: ‘MASS’

## The following object is masked from ‘package:dplyr’: ## ## select select <- dplyr::select

lda.fit <- lda(mpg01~cylinders+displacement+weight, data=auto.dfs$train)

lda.predictions <- predict(lda.fit, auto.dfs$test) mean(auto.dfs$test$mpg01 == lda.predictions$class)

## [1] 0.9147287

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.fit <- qda(mpg01~., data=auto.dfs$train)

qda.predictions <- predict(qda.fit, auto.dfs$test) mean(auto.dfs$test$mpg01 == qda.predictions$class)

## [1] 0.8992248

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm.fit <- glm(mpg01~., data=auto.dfs$train, family=’binomial’)

glm.probs <- predict(glm.fit, auto.dfs$test, type=”response”) glm.classes <- transmute(data.frame(predicted.probs=glm.probs),

mpg01 = ifelse(predicted.probs > 0.5, 1, 0)) mean(auto.dfs$test$mpg01 == glm.classes$mpg01)

## [1] 0.8837209

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

library(class)

kmax <- 25 # arbitrary! train.X <- auto.dfs$train %>% select(-mpg01) test.X <- auto.dfs$test %>% select(-mpg01) knn.errs = rep(NA, kmax) for (k in 1:kmax) {

knn.predictions <- knn(train.X, test.X, auto.dfs$train$mpg01, k=k) knn.errs[k] = 1 – mean(knn.predictions==auto.dfs$test$mpg01)

} plot(knn.errs)

7

5 10 15 20 25

0 .1

0 0

.1 2

0 .1

4 0

.1 6

0 .1

8

Index

kn n

.e rr

s

The error is lowest for k = 4, but not exceptionally lower and very likely not significantly lower.

8

- Lab 6: Classification Part 1
- Section 1: Statlog (heart) dataset
- Section 2: Auto dataset