The components of tidy data

Jeffrey Leek, Assistant Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health

The four things you should have

The strange binary file your measurement machine spits out
The unformatted Excel file with 10 worksheets the company you contracted with sent you
The complicated JSON data you got from scraping the Twitter API
The hand-entered numbers you collected looking through a microscope

You know the raw data is in the right format if you

Each variable you measure should be in one column
Each different observation of that variable should be in a different row
There should be one table for each "kind" of variable
If you have multiple tables, they should include a column in the table that allows them to be linked

Some other important tips

Information about the variables (including units!) in the data set not contained in the tidy data
Information about the summary choices you made
Information about the experimental study design you used

Some other important tips

A common format for this document is a Word/text file.
There should be a section called "Study design" that has a thorough description of how you collected the data.
There must be a section called "Code book" that describes each variable and its units.

In some cases it will not be possible to script every step. In that case you should provide instructions like:

Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
Step 2 - run the software separately for each sample
Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set