I believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. I am super excited to talk to you about ideas, work out solutions with you, and help you to figure out statistical methods and/or data analysis. I don’t think that graduate school grades are important for this purpose. This means that I don’t care very much about graduate student grades.

That being said, I have to give you a grade so they will be:

  1. A - Excellent - 90%+
  2. B - Passing - 80%+
  3. C - Needs improvement - 70%+

If you are getting a grade below a C it is because you basically aren’t trying/working. I rarely give them out.

Relative weights

The percentages will be assigned in the following way:

You get the points for the Datacamp modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis assignment will be graded on a 1-5 scale for each category described below and the percentages assigned as described below.


Data analysis assignment

(For more on my project philosophy see:

Each student will be required to perform a data analysis project during the course of the class. Students will have the entire term to perform the data analysis. The project assignments will consist of a scientific description of the problem. Students are responsible for all stages of each data analysis from obtaining the data to the final report. At the conclusion of each analysis each student must turn in:

  • A write-up of their data analysis in a synthesized format, with numbered figures and references. (You may also include supplementary material for detailed additional calculations/analyses)
  • A reproducible Rmd file that produces all of the numbers, figures and results in your write-up.

All documents should be submitted electronically. The grades will be broken down according to the following characterization of your data analysis.

  1. Did you answer the scientific question? (30%)
  2. Did you use appropriate statistical methods? (40%)
  3. Was your write-up simple, clear, and precise? (20%)
  4. Was your code reproducible? (10%)

Keep in mind that this is a data science class. In some cases standard methodology will be sufficient to answer the question of interest, in some cases you will need to go beyond the course, and in general the goal is to answer the question and provide an estimate of uncertainty. You may speak to your fellow students about specific statistical questions related to the projects, but the overall idea, analysis, and write-up should be your own individual work. You should cite any help you get from fellow students/TAs in your report in standard citation format.

Data analysis project options (Term 1)

You are required to pick one of the data analysis options below and perform that analysis over the course of the class.

Option 1

A major disaster is currently underway with hurricane Harvey. Use social media data (Twitter, Instagram, Reddit etc.) to identify times and areas that are hardest hit from the hurricane.

Option 2

We teach thousands of students data analysis in online classes on the Coursera platform. Each of these classes includes a final project. Collect the write-ups people use for the Getting and Cleaning Data project and summarize the main sources of variation in how people complete the project.

Option 3

Perform an analysis of “data scientist” jobs listed on job boards and on the employment pages of major companies. What are the most common skills that employers look for? What are the most unique skills that employers look for? Where are the types of companies that employ the most data scientists?

Option 4

Perform an analysis of the statistical analyses in all published PLoS papers. What are the most common techniques? How do they vary by field? Are there any trends over the last 10-15 years?

Option 5

Compete for the Zillow prize and write up your results:

Option 6

You may petition to do your own analysis. You must submit your petition by the 3rd day of class (Wednesday, September 6). We will let you know before the 4th day of class whether you have approval. The minimum requirements for the project include:

  • You must be obtaining your own raw data
  • You must be doing your own data processing
  • The data must be available to be made public by end of class
  • You must specify your own question you are asking from the data
  • You need to provide reasonable justification you can answer that question with your data.

If you are looking for ideas consider these resources:

Data analysis project options (Term 2)

These projects are designed to build a data product. The goal of building a data product is to create a data analytic app that can be used by someone without R programming experience. After polling the class some people want to work in large groups and some people want to work in small groups. I have therefore created a list of potential projects (see below) that can be completed with groups of varying sizes. We will spend class on Nov 6 organizing into groups and getting set up.

Maryland lunch project

In this project I would like you to collect all of the lunch and breakfast menus of each school district in Maryland for one month (for elementary schools) I would like you to estimate the nutritional value and variety in the lunches. I would also like you to correlate these with income in the school district, educaiton in the school district, and any other relevant demographics. You should make a Shiny app that can be used to explore the relationships between nutritional value, geography, and demographics.

Potential gender imbalances and the ASA

For this project, I’d like you to characterize potential gender and under represented minorities imbalances in American Statistical Association (ASA): Collect information from the public website (or with the help of the instructor, from the ASA). Then create an app that shows breakdowns of potential imbalances overall and over time. Compare these for various types of talks (e.g. invited vs contributed) and for topics (e.g. genomics versus causal inference).

Academic salary app

For this project I’d like you to scrape as much information as possible on salaries paid by public institutions to faculty, administrators and staff. Perform an analysis to associate job characteristics with salary. Create an app to allow people to explore where salaries are allocated with respect to geography and type of job.

Scholar profile app

Create an app where an individual can input their name and it pulls information for them on their citations (e.g. through Google Scholar), their NIH/NSF grants (e.g. through NIH reporter) and the classes they have taught (e.g. from myJHU). Then create a set of visualizatons and metrics that show both cumulative values as well as just the past year improvements.

MVA wait time app

The MVA of Maryland posts wait times every 5 minutes while the locations are open: Scrape wait times for several weeks to gather data on the distribution of wait times. Build an app that takes a users location and a specific time/date and lists which MVA location is best for them (you will need to define best by a combination of distance/wait time).

Maryland poison control app

The Maryland department of poison control makes data available for calls to the Poison control center by county over time in pdf format: Collect these data and then create an analysis that shows trends in poison control calls (reasons, people, etc.) over time. Create an app that allows people to explore the data for poison control calls in Maryland.

Supreme court topics app

Scrape data on decisions by the United States Supreme Court. Create a dynamic topic model that shows the types of topics that are being considered by the court on the basis of this text and allows users to show the topics that each justice writes the decision for most frequently.

Other Option

You may petition to do your own app and analysis. You must submit your petition by next week with an analysis plan and and a reason . The minimum requirements for the project include:

  • You must have an analysis plan
  • You must describe a potential design of the app
  • The data must be available to be made public by end of class
  • You must specify your own question you are asking from the data
  • You need to provide reasonable justification of the utility of the app

Data analysis reviews

To keep you on track, starting in the 2nd week you will bring your current writeup (in .Rmd format or later on in pdf format) to the course lab. The labs will be run by John and Stephen. You will take turns projecting your labs and getting detailed feedback from the instructors and the other students. You will receive credit for being prepared each week to present your data analysis even if you aren’t selected on that day.

Lab attendance and participation are mandatory. This is where you will learn how to write up and perform a data analysis. It is also the best way to get “hands dirty” with the projects people are working on.

The times for the reviews each week will be:

  • Stephen: TBD
  • John: TBD

Datacamp modules

Please see individual lectures for assigned DataCamp modules and due dates. You must sign up with John to get on the Datacamp module list. Please visit the Advanced Data Science link for the assignments