class: center, middle, inverse, title-slide # Machine Learning ## JHU Data Science ### www.jtleek.com/advdatasci --- class: inverse, middle, center # Prediction --- class: inverse ## Other names for prediction .huge[ * Prediction -> statisticians * Statistical learning -> statisticians * Machine learning -> computer scientists * Forecasting -> atmospheric scientists/bankers * Artificial intelligence -> the popular press ] --- class: inverse background-image: url(../imgs/ml/selfcar.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/whetlab.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/self_drive.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/blackbox.png) background-size: 60% background-position: center .footnote[https://twitter.com/notajf/status/795717253505413122] --- class: inverse ## Key idea: know why x predicts y .huge[ >The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. <br><br>-John Tukey ] --- class: inverse background-image: url(../imgs/ml/nobel.png) background-size: 70% background-position: center .footnote[http://www.nejm.org/doi/full/10.1056/NEJMon1211064] --- class: inverse background-image: url(../imgs/ml/prob.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/study_design.png) background-size: 40% background-position: center # Study design --- class: inverse background-image: url(../imgs/ml/overfitting.png) background-size: 80% background-position: center # Overfitting .footnote[http://blog.fliptop.com/blog/2015/03/02/bias-variance-and-overfitting-machine-learning-overview/] --- class: inverse background-image: url(../imgs/ml/training_data.png) background-size: 65% background-position: center # Training data matters a lot .footnote[http://www.google.org/flutrends/] --- class: inverse background-image: url(../imgs/ml/failed.png) background-size: 80% background-position: center # Changes in population -> failed predictions .footnote[http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf] --- class: inverse background-image: url(../imgs/ml/machine_bias.png) background-size: 80% background-position: center # Prediction != unbiased .footnote[https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing] --- class: inverse background-image: url(../imgs/ml/science.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/science_retract.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/prediction_vs_assoc.png) background-size: 80% background-position: center # Prediction vs. association .footnote[Slide courtesy Ingo Ruczinski] --- class: inverse background-image: url(../imgs/ml/sens_spec.png) background-size: 80% background-position: center .footnote[http://en.wikipedia.org/wiki/Sensitivity_and_specificity] --- class: inverse ## Most common error measures .super[ 1. Mean squared error (or root mean squared error) - Continuous data, sensitive to outliers 2. Median absolute deviation - Continuous data, often more robust 3. Sensitivity (recall) - If you want few missed positives 4. Specificity - If you want few negatives called positives 5. Accuracy - Weights false positives/negatives equally 6. Concordance - One example is kappa 7. Predictive value of a positive (precision) - When you are screeing and prevelance is low ] --- class: inverse background-image: url(../imgs/ml/roc.png) background-size: 50% background-position: center # ROC --- class: inverse background-image: url(../imgs/ml/precision_recall.png) background-size: 70% background-position: center --- class: inverse background-image: url(../imgs/ml/recall_curve.png) background-size: 45% background-position: center # Precision/Recall --- class: inverse background-image: url(../imgs/ml/measure_good.png) background-size: 50% background-position: center # Measuring "good" --- class: inverse background-image: url(../imgs/ml/sens_spec2.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/sens_spec3.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/sens_spec4.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/ppv.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/sens_spec5.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/sens_spec6.png) background-size: 80% background-position: center --- class: inverse background-image: url(../imgs/ml/depends.png) background-size: 75% background-position: center # Depends on your problem --- class: inverse background-image: url(../imgs/ml/prostate.png) background-size: 65% background-position: center # Depends on your problem --- class: inverse background-image: url(../imgs/ml/mamma.png) background-size: 90% background-position: center --- class: inverse, middle, center .sixer[ 89% sensitivity<br> 42% specificity<br> 65% accuracy<br> ] --- class: inverse background-image: url(../imgs/ml/tsp.png) background-size: 130% background-position: center --- class: inverse background-image: url(../imgs/ml/spam.png) background-size: 50% background-position: bottom # Steps in the process question -> input data -> features -> algorithm -> parameters -> evaluation --- class: inverse background-image: url(../imgs/ml/time_series.png) background-size: 80% background-position: bottom # Time series data --- class: inverse ## What is different? .super[ * Data are dependent over time * Specific pattern types * Trends - long term increase or decrease * Seasonal patterns - patterns related to time of week, month, year, etc. * Cycles - patterns that rise and fall periodically * Subsampling into training/test is more complicated * Similar issues arise in spatial data * Dependency between nearby observations * Location specific effects * Typically goal is to predict one or more observations into the future. * All standard predictions can be used (with caution!) ] --- class: inverse background-image: url(../imgs/ml/spurious.png) background-size: 70% background-position: center # Beware spurious correlations .footnote[http://www.google.com/trends/correlate,<br> http://www.newscientist.com/blogs/onepercent/2011/05/google-correlate-passes-our-we.html] --- class: inverse background-image: url(../imgs/ml/geog.png) background-size: 45% background-position: bottom # Also common in geographic analysis .footnote[http://xkcd.com/1138/] --- class: inverse background-image: url(../imgs/ml/test_sets.png) background-size: 80% background-position: center # Careful with picking your test sets .footnote[http://waldronlab.org/wp-content/uploads/2014/11/Waldron-et-al.-2014-Comparative-meta-analysis-of-prognostic-gene-signatures-for-late-stage-ovarian-cancer.pdf] --- class: inverse background-image: url(../imgs/ml/long_con.png) background-size: 45% background-position: center # An internet scam that is relevant to you .footnote[https://medium.com/message/how-to-always-be-right-on-the-internet-delete-your-mistakes-519a595da2f5#.lutyyhyex]