Advanced Data Science

About the course

Course Description

Provides an intensive introduction to applied statistics and data analysis. Trains students to become data scientists capable of both applied data analysis and critical evaluation of the next generation next generation of statistical methods. Since both data analysis and methods development require substantial hands-on experience, focuses on hands-on data analysis.

Course objectives

Upon successfully completing this course, students will be able to:

  1. Formulate quantitative models to address scientific questions
  2. Obtain, clean, transform, and process raw data into usable formats
  3. Organize and perform a complete data analysis, from exploration, to analysis, to synthesis, to communication
  4. Apply a range of statistical methods for inference and prediction
  5. Build data science products that can be used by a broad audience

Grading

I believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. I am super excited to talk to you about ideas, work out solutions with you, and help you to figure out statistical methods and/or data analysis. I don’t think that graduate school grades are important for this purpose. This means that I don’t care very much about graduate student grades.

That being said, I have to give you a grade so they will be:

  1. A - Excellent - 90%+
  2. B - Passing - 80%+
  3. C - Needs improvement - 70%+

If you are getting a grade below a C it is because you basically aren’t trying/working. I rarely give them out.

Adv Data Sci I

The percentages will be assigned in the following way:

  • Pre-class swirl module - 5%
  • Completing swirl modules - 45%
  • Preparedness/attendance at labs - 25%
  • Final project - 25%

You get the points for the swirl modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis assignment will be graded on a 1-5 scale for each category described below and the percentages assigned as described below.

Adv Data Sci II

The percentages will be assigned in the following way:

  • Completing swirl modules - 20%
  • Preparedness/attendance at labs - 50%
  • Final project - 50%

You get the points for the swirl modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis project will be graded on a 1-5 scale as described below.

Assignments

Data analysis assignment

(For more on my project philosophy see: http://bit.ly/wQT5uI)

Each student will be required to perform a data analysis project during the course of the class. Students will have the entire term to perform the data analysis. The project assignments will consist of a scientific description of the problem. Students are responsible for all stages of each data analysis from obtaining the data to the final report. At the conclusion of each analysis each student must turn in:

  • A write-up of their data analysis in a synthesized format, with numbered figures and references. (You may also include supplementary material for detailed additional calculations/analyses)
  • A reproducible Rmd file that produces all of the numbers, figures and results in your write-up.

All documents should be submitted electronically. The grades will be broken down according to the following characterization of your data analysis.

  1. Did you answer the scientific question? (30%)
  2. Did you use appropriate statistical methods? (40%)
  3. Was your write-up simple, clear, and precise? (20%)
  4. Was your code reproducible? (10%)

Keep in mind that this is a data science class. In some cases standard methodology will be sufficient to answer the question of interest, in some cases you will need to go beyond the course, and in general the goal is to answer the question and provide an estimate of uncertainty. You may speak to your fellow students about specific statistical questions related to the projects, but the overall idea, analysis, and write-up should be your own individual work. You should cite any help you get from fellow students/TAs in your report in standard citation format.

Data analysis project options

You are required to pick one of the data analysis options below and perform that analysis over the course of the class.

Option 1

Develop a prediction algorithm for identifying and classifying users that are trolling or being mean on Twitter. If you want an idea of what I’m talking about just look at the responses to any famous person’s tweets.

Option 2

Analyze the traffic fatality data to identify any geographic, time varying, or other characteristics that are associated with traffic fatalities: https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze.

Option 3

Develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. You may need to develop an approach for “downsampling” since the outcome data you’ll be able to find is likely aggregated at the neighborhood level (http://health.baltimorecity.gov/node/231).

Option 4

Develop a statistical model for inferring the variables you need to calculate the Gail score (http://www.cancer.gov/bcrisktool/) for a woman based on her Facebook profile. Develop a model for the Gail score prediction from Facebook and its uncertainty. You should include estimates of uncertainty in the predicted score due to your inferred variables.

Option 5

Potentially fun but super hard project. develop an algorithm for self-driving car using the training data: http://research.comma.ai/. Build a model for predicting at every moment what direction the car should be going, whether it should be signalling, and what speed it should be going. You might consider starting with a small subsample of the (big) training set.

Option 6

You may petition to do your own analysis. You must submit your petition by the 3rd day of class (Wednesday, September 14). We will let you know before the 4th day of class whether you have approval. The minimum requirements for the project include:

  • You must be obtaining your own raw data
  • You must be doing your own data processing
  • The data must be available to be made public by end of class
  • You must specify your own question you are asking from the data
  • You need to provide reasonable justification you can answer that question with your data.

If you are looking for ideas consider these resources:

Data analysis reviews

To keep you on track, starting in the 2nd week you will bring your current writeup (in .Rmd format or later on in pdf format) to the course lab. The labs will be run by John and Jeff. You will take turns projecting your labs and getting detailed feedback from the instructors and the other students. You will receive credit for being prepared each week to present your data analysis even if you aren’t selected on that day.

Lab attendance and participation are mandatory. This is where you will learn how to write up and perform a data analysis. It is also the best way to get “hands dirty” with the projects people are working on.

The times for the reviews each week will be:

  • Elizabeth: Friday 10:30 - 11:30
  • John: Tuesdays 10-11
  • Jeff: Tuesdays 3-4

Data product assignments

More coming soon but the grades will be assigned on the scale of:

  1. Did you solve the problem? (30%)
  2. Did you use appropriate statistical methods? (40%)
  3. Was your product usable? (20%)
  4. Was your product well documented? (10%)

swirl modules

For introducing tools and packages we will be using swirl courses to teach about concrete tools or packages. Each swirl course is required and should be completed before class on the day that it is assigned. You get 100% if you complete the module before class and 0% if you have not completed the module.

You can install swirl with the command:

# install.packages('swirl')
install.packages("devtools")
devtools::install_github("muschellij2/swirl", ref = "dev")

Set up your Rprofile

In order to track your progress in swirl, you need to create a directory. For example, ~/swirl_classes (or whatever directory you want). Then you will need to set up your Rprofile to find it. You need to create a file called .Rprofile in your home directory if it doesn’t exist. The code below tells you where your home directory is. Then it checks to see if you have a .Rprofile file and if not, creates one and tells you the path to that file.

homedir = path.expand("~")
cat(paste0("Your home directory is \n ",homedir))


rp = file.path(homedir,".Rprofile")
has_rp = file.exists(rp)
if(!has_rp){
  file.create(rp)
}
cat(paste0("Your r-profile is located: ",rp))

Open the .Rprofile file with a text editor and add these two lines, then save it.

options(
  swirl_data_dir = "~/swirl_classes",
  swirl_user = "my_jhed_id"
)

where you change the path to where you want the course progress to be stored (somewhere it won’t be deleted!) and your JHED information.

To check to see if this works start R, then run these commands which should print your progress directory and your JHED.

getOption("swirl_data_dir")
getOption("swirl_user")

Install swirl courses

You can install the pre-course swirl module with the commands:

library(swirl)
install_course_github("seankross", "AdvDataSci_Part_1")

Then type swirl() to get started.

To install the latest course swirl modules please use:

library(swirl)
install_course_github("jtleek", "advdatasci_swirl")

Course Materials

Date Lecture Notes Assignment
September 7 Welcome Slides Assignment 1
September 12 Organizing Slides 01_01_loading_data
September 14 Structure of a data analysis + getting data Slides
September 19 Getting data from the web Slides
September 23 Data knick knacks and strings Slides
September 26 Manipulating data Slides
September 28 Merging and databases Slides
October 3 Regular expressions Slides googlesheets,Grouping_and_Chaining_with_dplyr
October 5 Tidy text Slides Dates_and_Times_with_lubridate
October 10 Topic models, EDA Slides (topic models):::Slides (EDA) redo previous except for loading_data :(
October 17 EDA Slides (EDA)
October 10 Topic models, EDA Slides (EDA):::Slides (expository) Manipulating data with dplyr
October 19 Expository graphs Slides (expository)
October 26 Unsupervised analysis Slides
- - Adv Data Sci II -
October 31 Dimension reduction Slides
Nov 2 Shiny Slides (part1):::Slides (part2)
Nov 7 Data Products Slides (data products)::: Slides (Dimension reduction)
Nov 9 Modeling Slides
Nov 14 Prediction Slides
Nov 16 Prediction algorithms Slides
Nov 21 Prediction algorithms Slides
Nov 28 Blending - deep learning Slides
Dec 5 Simulating Stuff Slides
Dec 7 More simulating - multiple testing Slides stamps example
Dec 12 Multiple testing/big data Slides
Dec 14 Multiple testing/big data Slides
Dec 19 App demos Slides

Miscellaneous

Feel free to submit typos/errors/etc via the github repository associated with the class: https://github.com/jtleek/advdatasci16

This web-page is modified from Andrew Jaffe’s Summer 2015 R course, which also has great material if you want to learn R.

This page was last updated on 2016-12-19 18:17:02 Eastern Time.