This document depends on the following packages:

```
library(devtools)
library(Biobase)
```

To install these packages you can use the code (or if you are compiling the document, remove the `eval=FALSE`

from the chunk.)

```
install.packages(c("devtools"))
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biobase"))
```

- Make sure data is on the right scale for
- Visualization
- Statistical modeling

We will use this expression set to look at how we use plots and tables to check for different characteristics

```
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
pdata=pData(bm)
edata=as.data.frame(exprs(bm))
fdata = fData(bm)
ls()
```

```
## [1] "bm" "bodymap.eset" "con" "edata"
## [5] "fdata" "pdata" "tropical"
```

We would like continuous data to be nice and summetric like a normal distribution for two reasons:(1) plots are easier to see this way and (2) most statistical methods are designed to work better for non-skewed data.

`hist(rnorm(1000),col=2)`

Realistically though most genomic data is skewed

`hist(edata[,1],col=2,breaks=100)`

One way to address this skew is to use a transformation. One common transformation is the `log`

transform.

`hist(log(edata[,1]),col=2,breaks=100)`

One thing to be careful with is that values of zero become `-inf`

when you apply the log transform because the log of zero isn’t defined.

`min(log(edata))`

`## [1] -Inf`

You can remove this problem by adding a small number to all the counts before taking the log. This doesn’t change the overall distribution much but resolves the zero problem

`min(log(edata[,1] + 1))`

`## [1] 0`

`hist(log(edata[,1] + 1),breaks=100,col=2)`

A common choice is to use the `log2`

transform. The reason is that data on the `log2`

scale can be expressed as fold changes. If `log2(x) - log2(y) = z`

then there is a `z`

fold change difference between `x`

and `y`

.

`hist(log2(edata[,1] + 1),breaks=100,col=2)`

Let’s zoom in on the values greater than zero (we’ll come back to the zero values in a minute)

`hist(log2(edata[,1] + 1),breaks=100,col=2,xlim=c(1,15),ylim=c(0,400))`