I believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. I am super excited to talk to you about ideas, work out solutions with you, and help you to figure out statistical methods and/or data analysis. I don’t think that graduate school grades are important for this purpose. This means that I don’t care very much about graduate student grades.
That being said, I have to give you a grade so they will be:
If you are getting a grade below a C it is because you basically aren’t trying/working. I rarely give them out.
The percentages will be assigned in the following way:
You get the points for the Datacamp modules as long as you complete them before class starts (no exceptions without prior approval). You get 50% of the points for attendance at labs and 50% for having your current version of the code up-to-date. The data analysis assignment will be graded on a 1-5 scale for each category described below and the percentages assigned as described below.
(For more on my project philosophy see: http://bit.ly/wQT5uI)
Each student will be required to perform a data analysis project during the course of the class. Students will have the entire term to perform the data analysis. The project assignments will consist of a scientific description of the problem. Students are responsible for all stages of each data analysis from obtaining the data to the final report. At the conclusion of each analysis each student must turn in:
All documents should be submitted electronically. The grades will be broken down according to the following characterization of your data analysis.
Keep in mind that this is a data science class. In some cases standard methodology will be sufficient to answer the question of interest, in some cases you will need to go beyond the course, and in general the goal is to answer the question and provide an estimate of uncertainty. You may speak to your fellow students about specific statistical questions related to the projects, but the overall idea, analysis, and write-up should be your own individual work. You should cite any help you get from fellow students/TAs in your report in standard citation format.
You are required to pick one of the data analysis options below and perform that analysis over the course of the class.
A major disaster is currently underway with hurricane Harvey. Use social media data (Twitter, Instagram, Reddit etc.) to identify times and areas that are hardest hit from the hurricane.
We teach thousands of students data analysis in online classes on the Coursera platform. Each of these classes includes a final project. Collect the write-ups people use for the Getting and Cleaning Data project and summarize the main sources of variation in how people complete the project.
Perform an analysis of “data scientist” jobs listed on job boards and on the employment pages of major companies. What are the most common skills that employers look for? What are the most unique skills that employers look for? Where are the types of companies that employ the most data scientists?
Perform an analysis of the statistical analyses in all published PLoS papers. What are the most common techniques? How do they vary by field? Are there any trends over the last 10-15 years?
Compete for the Zillow prize and write up your results: https://www.zillow.com/promo/Zillow-prize/
You may petition to do your own analysis. You must submit your petition by the 3rd day of class (Wednesday, September 6). We will let you know before the 4th day of class whether you have approval. The minimum requirements for the project include:
If you are looking for ideas consider these resources:
These projects are designed to build a data product. The goal of building a data product is to create a data analytic app that can be used by someone without R programming experience. After polling the class some people want to work in large groups and some people want to work in small groups. I have therefore created a list of potential projects (see below) that can be completed with groups of varying sizes. We will spend class on Nov 6 organizing into groups and getting set up.
In this project I would like you to collect all of the lunch and breakfast menus of each school district in Maryland for one month (for elementary schools) I would like you to estimate the nutritional value and variety in the lunches. I would also like you to correlate these with income in the school district, educaiton in the school district, and any other relevant demographics. You should make a Shiny app that can be used to explore the relationships between nutritional value, geography, and demographics.
For this project, I’d like you to characterize potential gender and under represented minorities imbalances in American Statistical Association (ASA): http://www.amstat.org/. Collect information from the public website (or with the help of the instructor, from the ASA). Then create an app that shows breakdowns of potential imbalances overall and over time. Compare these for various types of talks (e.g. invited vs contributed) and for topics (e.g. genomics versus causal inference).
For this project I’d like you to scrape as much information as possible on salaries paid by public institutions to faculty, administrators and staff. Perform an analysis to associate job characteristics with salary. Create an app to allow people to explore where salaries are allocated with respect to geography and type of job.
Create an app where an individual can input their name and it pulls information for them on their citations (e.g. through Google Scholar), their NIH/NSF grants (e.g. through NIH reporter) and the classes they have taught (e.g. from myJHU). Then create a set of visualizatons and metrics that show both cumulative values as well as just the past year improvements.
The MVA of Maryland posts wait times every 5 minutes while the locations are open: http://www.mva.maryland.gov/sebin/customerwaittimes/. Scrape wait times for several weeks to gather data on the distribution of wait times. Build an app that takes a users location and a specific time/date and lists which MVA location is best for them (you will need to define best by a combination of distance/wait time).
The Maryland department of poison control makes data available for calls to the Poison control center by county over time in pdf format: http://www.mdpoison.com/factsandreports/. Collect these data and then create an analysis that shows trends in poison control calls (reasons, people, etc.) over time. Create an app that allows people to explore the data for poison control calls in Maryland.
Scrape data on decisions by the United States Supreme Court. Create a dynamic topic model that shows the types of topics that are being considered by the court on the basis of this text and allows users to show the topics that each justice writes the decision for most frequently.
You may petition to do your own app and analysis. You must submit your petition by next week with an analysis plan and and a reason . The minimum requirements for the project include:
To keep you on track, starting in the 2nd week you will bring your current writeup (in .Rmd format or later on in pdf format) to the course lab. The labs will be run by John and Stephen. You will take turns projecting your labs and getting detailed feedback from the instructors and the other students. You will receive credit for being prepared each week to present your data analysis even if you aren’t selected on that day.
Lab attendance and participation are mandatory. This is where you will learn how to write up and perform a data analysis. It is also the best way to get “hands dirty” with the projects people are working on.
The times for the reviews each week will be:
Please see individual lectures for assigned DataCamp modules and due dates. You must sign up with John to get on the Datacamp module list. Please visit the Advanced Data Science link for the assignments