Dependencies

This document depends on the following packages:

  library(devtools)
  library(Biobase)

To install these packages you can use the code (or if you are compiling the document, remove the eval=FALSE from the chunk.)

install.packages(c("devtools"))
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biobase"))

General principles

Load some data

We will use this expression set to look at how we use plots and tables to check for different characteristics

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
pdata=pData(bm)
edata=as.data.frame(exprs(bm))
fdata = fData(bm)
ls()
## [1] "bm"           "bodymap.eset" "con"          "edata"       
## [5] "fdata"        "pdata"        "tropical"

Skewed distributions

We would like continuous data to be nice and summetric like a normal distribution for two reasons:(1) plots are easier to see this way and (2) most statistical methods are designed to work better for non-skewed data.

hist(rnorm(1000),col=2)

Realistically though most genomic data is skewed

hist(edata[,1],col=2,breaks=100)

One way to address this skew is to use a transformation. One common transformation is the log transform.

hist(log(edata[,1]),col=2,breaks=100)

One thing to be careful with is that values of zero become -inf when you apply the log transform because the log of zero isn’t defined.

min(log(edata))
## [1] -Inf

You can remove this problem by adding a small number to all the counts before taking the log. This doesn’t change the overall distribution much but resolves the zero problem

min(log(edata[,1] + 1))
## [1] 0
hist(log(edata[,1] + 1),breaks=100,col=2)

A common choice is to use the log2 transform. The reason is that data on the log2 scale can be expressed as fold changes. If log2(x) - log2(y) = z then there is a z fold change difference between x and y.

hist(log2(edata[,1] + 1),breaks=100,col=2)

Let’s zoom in on the values greater than zero (we’ll come back to the zero values in a minute)

hist(log2(edata[,1] + 1),breaks=100,col=2,xlim=c(1,15),ylim=c(0,400))