Dependencies

This document depends on the following packages:

  library(devtools)
  library(Biobase)
  library(preprocessCore)

To install these packages you can use the code (or if you are compiling the document, remove the eval=FALSE from the chunk.)

install.packages(c("devtools"))
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biobase","preprocessCore"))

General principles

Load some data

We will use this expression set that combines two studies Transcriptome genetics using second generation sequencing in a Caucasian population. and Understanding mechanisms underlying human gene expression variation with RNA sequencing.. These studies are different populations but we counted the same genes for both. Then we’ll explore the differences.

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)
ls()
## [1] "con"           "edata"         "fdata"         "montpick.eset"
## [5] "mp"            "pdata"         "tropical"

Show distributions for log2 counts for several samples

Here we show density plots for the first 20 samples

edata = log2(edata + 1)
edata = edata[rowMeans(edata) > 3, ]
colramp = colorRampPalette(c(3,"white",2))(20)
plot(density(edata[,1]),col=colramp[1],lwd=3,ylim=c(0,.30))
for(i in 2:20){lines(density(edata[,i]),lwd=3,col=colramp[i])}

Quantile normalization

Now we perform quantile normalization to make the distributions the same across samples. Note that near the tail the distributions aren’t perfectly the same, but for the most part the distributions land right on top of each other.

norm_edata = normalize.quantiles(as.matrix(edata))
plot(density(norm_edata[,1]),col=colramp[1],lwd=3,ylim=c(0,.20))
for(i in 2:20){lines(density(norm_edata[,i]),lwd=3,col=colramp[i])}

Matching distributions leaves variability

Normalization removes bulk differences due to technology. But there still may be differences you don’t want after normalization. The only way to figure this out is to check. For example if we plot the quantile normalized data with the first

plot(norm_edata[1,],col=as.numeric(pdata$study))