Practical Machine Learning Overview

Jeffrey Leek
Johns Hopkins Bloomberg School of Public Health

Practical Machine Learning Content

Prediction study design
Types of Errors
Cross validation
The caret package
Plotting for prediction
Preprocessing
Predicting with regression
Predicting with trees
Boosting
Bagging
Model blending
Forecasting

Basic terms

In general, Positive = identified and negative = rejected. Therefore:

True positive = correctly identified
False positive = incorrectly identified
True negative = correctly rejected
False negative = incorrectly rejected

Medical testing example:

True positive = Sick people correctly diagnosed as sick
False positive= Healthy people incorrectly identified as sick
True negative = Healthy people correctly identified as healthy
False negative = Sick people incorrectly identified as healthy.

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

Correlated predictors

library(caret)
library(kernlab)
data(spam)
inTrain <- createDataPartition(y = spam$type, p = 0.75, list = FALSE)
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]

M <- abs(cor(training[, -58]))
diag(M) <- 0
which(M > 0.8, arr.ind = T)

##        row col
## num415  34  32
## direct  40  32
## num857  32  34
## num857  32  40

Basic idea behind boosting

Start with a set of classifiers \(h_1,\ldots,h_k\)
- Examples: All possible trees, all possible regression models, all possible cutoffs.
Create a classifier that combines classification functions: \(f(x) = \rm{sgn}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)\).
- Goal is to minimize error (on training set)
- Iterative, select one \(h\) at each step
- Calculate weights based on errors
- Upweight missed classifications and select next \(h\)

Adaboost on Wikipedia

http://webee.technion.ac.il/people/rmeir/BoostingTutorial.pdf