# Types of Data Science Questions

Jeffrey Leek
Johns Hopkins Bloomberg School of Public Health

## Types of Data Science Questions

In approximate order of difficulty

• Descriptive
• Exploratory
• Inferential
• Predictive
• Causal
• Mechanistic

Goal: Describe a set of data

• The first kind of data analysis performed
• Commonly applied to census data
• The description and interpretation are different steps
• Descriptions can usually not be generalized without additional statistical modeling

## Descriptive analysis

Goal: Find relationships you didn't know about

• Exploratory models are good for discovering new connections
• They are also useful for defining future studies
• Exploratory analyses are usually not the final say
• Exploratory analyses alone should not be used for generalizing/predicting
• Correlation does not imply causation

## Exploratory analysis

Goal: Use a relatively small sample of data to say something about a bigger population

• Inference is commonly the goal of statistical models
• Inference depends heavily on both the population and the sampling scheme

## Inferential analysis

Goal: To use the data on some objects to predict values for another object

• If $X$ predicts $Y$ it does not mean that $X$ causes $Y$
• Accurate prediction depends heavily on measuring the right variables
• Although there are better and worse prediction models, more data and a simple model works really well
• Prediction is very hard, especially about the future references

## Predictive analysis

Goal: To find out what happens to one variable when you make another variable change.

• Usually randomized studies are required to identify causation
• There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions
• Causal relationships are usually identified as average effects, but may not apply to every individual
• Causal models are usually the "gold standard" for data analysis