Advanced Data Science

About the course

Course Description

Provides an intensive introduction to applied statistics and data analysis. Trains students to become data scientists capable of both applied data analysis and critical evaluation of the next generation next generation of statistical methods. Since both data analysis and methods development require substantial hands-on experience, focuses on hands-on data analysis.

Course objectives

Upon successfully completing this course, students will be able to:

Formulate quantitative models to address scientific questions
Obtain, clean, transform, and process raw data into usable formats
Organize and perform a complete data analysis, from exploration, to analysis, to synthesis, to communication
Apply a range of statistical methods for inference and prediction
Build data science products that can be used by a broad audience

Instructors

TAs

Detian Deng

Class Website

https://jtleek.com/advdatasci16

Resources/Books

Grading

I believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. I am super excited to talk to you about ideas, work out solutions with you, and help you to figure out statistical methods and/or data analysis. I don’t think that graduate school grades are important for this purpose. This means that I don’t care very much about graduate student grades.

That being said, I have to give you a grade so they will be:

A - Excellent - 90%+
B - Passing - 80%+
C - Needs improvement - 70%+

If you are getting a grade below a C it is because you basically aren’t trying/working. I rarely give them out.

Adv Data Sci I

The percentages will be assigned in the following way:

Pre-class swirl module - 5%
Completing swirl modules - 45%
Preparedness/attendance at labs - 25%
Final project - 25%

You get the points for the swirl modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis assignment will be graded on a 1-5 scale for each category described below and the percentages assigned as described below.

Adv Data Sci II

The percentages will be assigned in the following way:

Completing swirl modules - 20%
Preparedness/attendance at labs - 50%
Final project - 50%

You get the points for the swirl modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis project will be graded on a 1-5 scale as described below.

Assignments

Data analysis assignment

(For more on my project philosophy see: http://bit.ly/wQT5uI)

Each student will be required to perform a data analysis project during the course of the class. Students will have the entire term to perform the data analysis. The project assignments will consist of a scientific description of the problem. Students are responsible for all stages of each data analysis from obtaining the data to the final report. At the conclusion of each analysis each student must turn in:

A write-up of their data analysis in a synthesized format, with numbered figures and references. (You may also include supplementary material for detailed additional calculations/analyses)
A reproducible Rmd file that produces all of the numbers, figures and results in your write-up.

All documents should be submitted electronically. The grades will be broken down according to the following characterization of your data analysis.

Did you answer the scientific question? (30%)
Did you use appropriate statistical methods? (40%)
Was your write-up simple, clear, and precise? (20%)
Was your code reproducible? (10%)

Keep in mind that this is a data science class. In some cases standard methodology will be sufficient to answer the question of interest, in some cases you will need to go beyond the course, and in general the goal is to answer the question and provide an estimate of uncertainty. You may speak to your fellow students about specific statistical questions related to the projects, but the overall idea, analysis, and write-up should be your own individual work. You should cite any help you get from fellow students/TAs in your report in standard citation format.

Data analysis project options

You are required to pick one of the data analysis options below and perform that analysis over the course of the class.

Option 1

Develop a prediction algorithm for identifying and classifying users that are trolling or being mean on Twitter. If you want an idea of what I’m talking about just look at the responses to any famous person’s tweets.

Option 2

Analyze the traffic fatality data to identify any geographic, time varying, or other characteristics that are associated with traffic fatalities: https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze.

Option 3

Develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. You may need to develop an approach for “downsampling” since the outcome data you’ll be able to find is likely aggregated at the neighborhood level (http://health.baltimorecity.gov/node/231).

Option 4

Develop a statistical model for inferring the variables you need to calculate the Gail score (http://www.cancer.gov/bcrisktool/) for a woman based on her Facebook profile. Develop a model for the Gail score prediction from Facebook and its uncertainty. You should include estimates of uncertainty in the predicted score due to your inferred variables.

Option 5

Potentially fun but super hard project. develop an algorithm for self-driving car using the training data: http://research.comma.ai/. Build a model for predicting at every moment what direction the car should be going, whether it should be signalling, and what speed it should be going. You might consider starting with a small subsample of the (big) training set.

Option 6

You may petition to do your own analysis. You must submit your petition by the 3rd day of class (Wednesday, September 14). We will let you know before the 4th day of class whether you have approval. The minimum requirements for the project include:

You must be obtaining your own raw data
You must be doing your own data processing
The data must be available to be made public by end of class
You must specify your own question you are asking from the data
You need to provide reasonable justification you can answer that question with your data.

If you are looking for ideas consider these resources:

Data analysis reviews

To keep you on track, starting in the 2nd week you will bring your current writeup (in .Rmd format or later on in pdf format) to the course lab. The labs will be run by John and Jeff. You will take turns projecting your labs and getting detailed feedback from the instructors and the other students. You will receive credit for being prepared each week to present your data analysis even if you aren’t selected on that day.

Lab attendance and participation are mandatory. This is where you will learn how to write up and perform a data analysis. It is also the best way to get “hands dirty” with the projects people are working on.

The times for the reviews each week will be:

Elizabeth: Friday 10:30 - 11:30
John: Tuesdays 10-11
Jeff: Tuesdays 3-4

Data product assignments

More coming soon but the grades will be assigned on the scale of:

Did you solve the problem? (30%)
Did you use appropriate statistical methods? (40%)
Was your product usable? (20%)
Was your product well documented? (10%)

swirl modules

For introducing tools and packages we will be using swirl courses to teach about concrete tools or packages. Each swirl course is required and should be completed before class on the day that it is assigned. You get 100% if you complete the module before class and 0% if you have not completed the module.

You can install swirl with the command:

# install.packages('swirl')
install.packages("devtools")
devtools::install_github("muschellij2/swirl", ref = "dev")

Set up your Rprofile

In order to track your progress in swirl, you need to create a directory. For example, ~/swirl_classes (or whatever directory you want). Then you will need to set up your Rprofile to find it. You need to create a file called .Rprofile in your home directory if it doesn’t exist. The code below tells you where your home directory is. Then it checks to see if you have a .Rprofile file and if not, creates one and tells you the path to that file.

homedir = path.expand("~")
cat(paste0("Your home directory is \n ",homedir))


rp = file.path(homedir,".Rprofile")
has_rp = file.exists(rp)
if(!has_rp){
  file.create(rp)
}
cat(paste0("Your r-profile is located: ",rp))

Open the .Rprofile file with a text editor and add these two lines, then save it.

options(
  swirl_data_dir = "~/swirl_classes",
  swirl_user = "my_jhed_id"
)

where you change the path to where you want the course progress to be stored (somewhere it won’t be deleted!) and your JHED information.

To check to see if this works start R, then run these commands which should print your progress directory and your JHED.

getOption("swirl_data_dir")
getOption("swirl_user")

Install swirl courses

You can install the pre-course swirl module with the commands:

library(swirl)
install_course_github("seankross", "AdvDataSci_Part_1")

Then type swirl() to get started.

To install the latest course swirl modules please use:

library(swirl)
install_course_github("jtleek", "advdatasci_swirl")

Course Materials

Date	Lecture	Notes	Assignment
September 7	Welcome	Slides	Assignment 1
September 12	Organizing	Slides	01_01_loading_data
September 14	Structure of a data analysis + getting data	Slides
September 19	Getting data from the web	Slides
September 23	Data knick knacks and strings	Slides
September 26	Manipulating data	Slides
September 28	Merging and databases	Slides
October 3	Regular expressions	Slides	googlesheets,Grouping_and_Chaining_with_dplyr
October 5	Tidy text	Slides	Dates_and_Times_with_lubridate
October 10	Topic models, EDA	Slides (topic models):::Slides (EDA)	redo previous except for loading_data :(
October 17	EDA	Slides (EDA)
October 10	Topic models, EDA	Slides (EDA):::Slides (expository)	Manipulating data with dplyr
October 19	Expository graphs	Slides (expository)
October 26	Unsupervised analysis	Slides
-	-	Adv Data Sci II	-
October 31	Dimension reduction	Slides
Nov 2	Shiny	Slides (part1):::Slides (part2)
Nov 7	Data Products	Slides (data products)::: Slides (Dimension reduction)
Nov 9	Modeling	Slides
Nov 14	Prediction	Slides
Nov 16	Prediction algorithms	Slides
Nov 21	Prediction algorithms	Slides
Nov 28	Blending - deep learning	Slides
Dec 5	Simulating Stuff	Slides
Dec 7	More simulating - multiple testing	Slides	stamps example
Dec 12	Multiple testing/big data	Slides
Dec 14	Multiple testing/big data	Slides
Dec 19	App demos	Slides

Miscellaneous

Feel free to submit typos/errors/etc via the github repository associated with the class: https://github.com/jtleek/advdatasci16

This web-page is modified from Andrew Jaffe’s Summer 2015 R course, which also has great material if you want to learn R.

This page was last updated on 2016-12-19 18:17:02 Eastern Time.