Provides an intensive introduction to applied statistics and data analysis. Trains students to become data scientists capable of both applied data analysis and critical evaluation of the next generation next generation of statistical methods. Since both data analysis and methods development require substantial hands-on experience, focuses on hands-on data analysis.
Upon successfully completing this course, students will be able to:
I believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. I am super excited to talk to you about ideas, work out solutions with you, and help you to figure out statistical methods and/or data analysis. I don’t think that graduate school grades are important for this purpose. This means that I don’t care very much about graduate student grades.
That being said, I have to give you a grade so they will be:
If you are getting a grade below a C it is because you basically aren’t trying/working. I rarely give them out.
The percentages will be assigned in the following way:
You get the points for the swirl modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis assignment will be graded on a 1-5 scale for each category described below and the percentages assigned as described below.
The percentages will be assigned in the following way:
You get the points for the swirl modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis project will be graded on a 1-5 scale as described below.
(For more on my project philosophy see: http://bit.ly/wQT5uI)
Each student will be required to perform a data analysis project during the course of the class. Students will have the entire term to perform the data analysis. The project assignments will consist of a scientific description of the problem. Students are responsible for all stages of each data analysis from obtaining the data to the final report. At the conclusion of each analysis each student must turn in:
All documents should be submitted electronically. The grades will be broken down according to the following characterization of your data analysis.
Keep in mind that this is a data science class. In some cases standard methodology will be sufficient to answer the question of interest, in some cases you will need to go beyond the course, and in general the goal is to answer the question and provide an estimate of uncertainty. You may speak to your fellow students about specific statistical questions related to the projects, but the overall idea, analysis, and write-up should be your own individual work. You should cite any help you get from fellow students/TAs in your report in standard citation format.
You are required to pick one of the data analysis options below and perform that analysis over the course of the class.
Develop a prediction algorithm for identifying and classifying users that are trolling or being mean on Twitter. If you want an idea of what I’m talking about just look at the responses to any famous person’s tweets.
Analyze the traffic fatality data to identify any geographic, time varying, or other characteristics that are associated with traffic fatalities: https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze.
Develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. You may need to develop an approach for “downsampling” since the outcome data you’ll be able to find is likely aggregated at the neighborhood level (http://health.baltimorecity.gov/node/231).
Develop a statistical model for inferring the variables you need to calculate the Gail score (http://www.cancer.gov/bcrisktool/) for a woman based on her Facebook profile. Develop a model for the Gail score prediction from Facebook and its uncertainty. You should include estimates of uncertainty in the predicted score due to your inferred variables.
Potentially fun but super hard project. develop an algorithm for self-driving car using the training data: http://research.comma.ai/. Build a model for predicting at every moment what direction the car should be going, whether it should be signalling, and what speed it should be going. You might consider starting with a small subsample of the (big) training set.
You may petition to do your own analysis. You must submit your petition by the 3rd day of class (Wednesday, September 14). We will let you know before the 4th day of class whether you have approval. The minimum requirements for the project include:
If you are looking for ideas consider these resources:
To keep you on track, starting in the 2nd week you will bring your current writeup (in .Rmd format or later on in pdf format) to the course lab. The labs will be run by John and Jeff. You will take turns projecting your labs and getting detailed feedback from the instructors and the other students. You will receive credit for being prepared each week to present your data analysis even if you aren’t selected on that day.
Lab attendance and participation are mandatory. This is where you will learn how to write up and perform a data analysis. It is also the best way to get “hands dirty” with the projects people are working on.
The times for the reviews each week will be:
More coming soon but the grades will be assigned on the scale of:
For introducing tools and packages we will be using swirl courses to teach about concrete tools or packages. Each swirl course is required and should be completed before class on the day that it is assigned. You get 100% if you complete the module before class and 0% if you have not completed the module.
You can install swirl with the command:
# install.packages('swirl')
install.packages("devtools")
devtools::install_github("muschellij2/swirl", ref = "dev")
In order to track your progress in swirl
, you need to create a directory. For example, ~/swirl_classes
(or whatever directory you want). Then you will need to set up your Rprofile to find it. You need to create a file called .Rprofile
in your home directory if it doesn’t exist. The code below tells you where your home directory is. Then it checks to see if you have a .Rprofile
file and if not, creates one and tells you the path to that file.
homedir = path.expand("~")
cat(paste0("Your home directory is \n ",homedir))
rp = file.path(homedir,".Rprofile")
has_rp = file.exists(rp)
if(!has_rp){
file.create(rp)
}
cat(paste0("Your r-profile is located: ",rp))
Open the .Rprofile
file with a text editor and add these two lines, then save it.
options(
swirl_data_dir = "~/swirl_classes",
swirl_user = "my_jhed_id"
)
where you change the path to where you want the course progress to be stored (somewhere it won’t be deleted!) and your JHED information.
To check to see if this works start R
, then run these commands which should print your progress directory and your JHED.
getOption("swirl_data_dir")
getOption("swirl_user")
You can install the pre-course swirl module with the commands:
library(swirl)
install_course_github("seankross", "AdvDataSci_Part_1")
Then type swirl()
to get started.
To install the latest course swirl modules please use:
library(swirl)
install_course_github("jtleek", "advdatasci_swirl")
Date | Lecture | Notes | Assignment |
---|---|---|---|
September 7 | Welcome | Slides | Assignment 1 |
September 12 | Organizing | Slides | 01_01_loading_data |
September 14 | Structure of a data analysis + getting data | Slides | |
September 19 | Getting data from the web | Slides | |
September 23 | Data knick knacks and strings | Slides | |
September 26 | Manipulating data | Slides | |
September 28 | Merging and databases | Slides | |
October 3 | Regular expressions | Slides | googlesheets,Grouping_and_Chaining_with_dplyr |
October 5 | Tidy text | Slides | Dates_and_Times_with_lubridate |
October 10 | Topic models, EDA | Slides (topic models):::Slides (EDA) | redo previous except for loading_data :( |
October 17 | EDA | Slides (EDA) | |
October 10 | Topic models, EDA | Slides (EDA):::Slides (expository) | Manipulating data with dplyr |
October 19 | Expository graphs | Slides (expository) | |
October 26 | Unsupervised analysis | Slides | |
- | - | Adv Data Sci II | - |
October 31 | Dimension reduction | Slides | |
Nov 2 | Shiny | Slides (part1):::Slides (part2) | |
Nov 7 | Data Products | Slides (data products)::: Slides (Dimension reduction) | |
Nov 9 | Modeling | Slides | |
Nov 14 | Prediction | Slides | |
Nov 16 | Prediction algorithms | Slides | |
Nov 21 | Prediction algorithms | Slides | |
Nov 28 | Blending - deep learning | Slides | |
Dec 5 | Simulating Stuff | Slides | |
Dec 7 | More simulating - multiple testing | Slides | stamps example |
Dec 12 | Multiple testing/big data | Slides | |
Dec 14 | Multiple testing/big data | Slides | |
Dec 19 | App demos | Slides |
Feel free to submit typos/errors/etc via the github repository associated with the class: https://github.com/jtleek/advdatasci16
This web-page is modified from Andrew Jaffe’s Summer 2015 R course, which also has great material if you want to learn R.
This page was last updated on 2016-12-19 18:17:02 Eastern Time.