Dependencies

This document depends on the following packages:

  library(devtools)
  library(Biobase)

To install these packages you can use the code (or if you are compiling the document, remove the eval=FALSE from the chunk.)

install.packages(c("devtools"))
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biobase"))

General principles

Load some data

We will use this expression set that combines two studies Transcriptome genetics using second generation sequencing in a Caucasian population. and Understanding mechanisms underlying human gene expression variation with RNA sequencing.. These studies are different populations but we counted the same genes for both. Then we’ll explore the differences.

con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)
ls()
## [1] "con"           "edata"         "fdata"         "montpick.eset"
## [5] "mp"            "pdata"         "tropical"

Calculate the singular vectors

Here we calculate the singular vectors:

edata = edata[rowMeans(edata) > 100, ]
edata = log2(edata + 1)
edata_centered = edata - rowMeans(edata)
svd1 = svd(edata_centered)
names(svd1)
## [1] "d" "u" "v"

Look at the percent variance explained

The percent of variance explained is given by \(\frac{d_{ii}}{\sum_{j}d_{jj}^2}\)

plot(svd1$d,ylab="Singular value",col=2)

plot(svd1$d^2/sum(svd1$d^2),ylab="Percent Variance Explained",col=2)

Plot top two principal components

par(mfrow=c(1,2))
plot(svd1$v[,1],col=2,ylab="1st PC")
plot(svd1$v[,2],col=2,ylab="2nd PC")

Plot PC1 vs. PC2

A very common plot is to plot PC1 versus PC2 to see if you can see any “clusters” or “groups”.

plot(svd1$v[,1],svd1$v[,2],col=2,ylab="2nd PC",xlab="1st PC")

One thing you can do is color them by different variables to see if clusters stand out.

plot(svd1$v[,1],svd1$v[,2],ylab="2nd PC",
     xlab="1st PC",col=as.numeric(pdata$study))