processgeuvadis.Rmd

Load packages

You will need the RSkittleBrewer package for this vignette to run. Installation instructions are available here:

https://github.com/alyssafrazee/RskittleBrewer

You will also need the “for-bioc” branch of the Ballgown package available for install from here:

https://github.com/alyssafrazee/ballgown/tree/alpha

And the processed GEUVADIS fpkm data from here:

https://github.com/alyssafrazee/ballgown_code

You will also need R version 3.1.0 or greater and Bioconductor 3.0 or greater. The zebrafishRNASeq package might need to be installed from source. These analyses are based on the devel version of sva (version 3.11.2 or greater).

library(zebrafishRNASeq)
library(RSkittleBrewer)
library(genefilter)
library(Biobase)
library(ballgown)
library(RUVSeq)
library(edgeR)
library(sva)
library(ffpe)
library(RColorBrewer)
library(corrplot)
library(limma)
trop = RSkittleBrewer('tropical')

Load the data

You will need to download the GEUVADIS ballgown object from this site: https://github.com/alyssafrazee/ballgown_code

load("fpkm.rda")
pd = ballgown::pData(fpkm)
pd$dirname = as.character(pd$dirname)
ss = function(x, pattern, slot=1,...) sapply(strsplit(x,pattern,...), "[", slot)
pd$IndividualID = ss(pd$dirname, "_", 1)
tfpkm = expr(fpkm)$trans

Subset to non-duplicates

You will need the GEUVADIS quality control information and population information available from these sites: https://www.dropbox.com/s/rg63qtuws2liz9r/GD667.QCstats.masterfile.txt, https://www.dropbox.com/s/woacfjxql7gxhnt/pop_data_withuniqueid.txt.

pheno = read.delim("GD667.QCstats.masterfile.txt", as.is=TRUE)
m = read.delim("pop_data_withuniqueid.txt",as.is=TRUE)
pd$SampleID = m$sample_id[match(pd$dirname, m$folder_id)]
pd$UseThisDup = pheno$UseThisDuplicate[match(pd$SampleID, rownames(pheno))]
pd$batch = pheno$RNAExtractionBatch[match(pd$SampleID, rownames(pheno))]
pd$lab = pheno$SeqLabNumber[match(pd$SampleID, rownames(pheno))]
pd$libprepdate = pheno$LibraryPrepDate[match(pd$SampleID, rownames(pheno))]
## drop duplicates for this

pd = pd[pd$UseThisDup == 1,]

## subset the transcript expression data to match pheno data

colnames(tfpkm) = ss(colnames(tfpkm),"\\.",2)
tfpkm = tfpkm[,pd$dirname]

## Remove low expression transcripts
rowmfpkm = rowMeans(log(tfpkm+1)) 
keepIndex=which(rowmfpkm > 1)
tfpkm = tfpkm[keepIndex,]
tfpkm = as.matrix(tfpkm)

Run the different analyses and compare to lab

mod1 = model.matrix(~ pd$population)
mod0 = cbind(mod1[,1])

## Estimate batch with svaseq (unsupervised)
batch_unsup_sva = svaseq(tfpkm,mod1,mod0,n.sv=1)$sv

## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

## Estimate batch with pca
ldat0 = log(tfpkm + 1)
batch_pca = svd(ldat0 - rowMeans(ldat0))$v[,1]

## Estimate batch with ruv (residuals)
## this procedure follows the RUVSeq vignette
## http://www.bioconductor.org/packages/devel/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf

x <- as.factor(pd$population)
design <- model.matrix(~x)
y <- DGEList(counts=tfpkm, group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
res <- residuals(fit, type="deviance")
seqUQ <- betweenLaneNormalization(tfpkm, which="upper")
controls = rep(TRUE,dim(tfpkm)[1])
batch_ruv_res = RUVr(seqUQ,controls,k=1,res)$W

## Estimate batch with ruv empirical controls
## this procedure follows the RUVSeq vignette
## http://www.bioconductor.org/packages/devel/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf

y <- DGEList(counts=tfpkm, group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)

controls = rank(lrt$table$LR) <= 400
batch_ruv_emp <- RUVg(tfpkm, controls, k=1)$W

Correlation between estimates

batchEstimates = cbind(batch_unsup_sva,batch_pca,batch_ruv_res,batch_ruv_emp)
colnames(batchEstimates) = c("usva","pca","ruvres","ruvemp")

corr = abs(cor(batchEstimates))
cols = colorRampPalette(c(trop[2],"white",trop[1]))
corrplot(corr,method="ellipse",type="lower",col=cols(100),tl.pos="d")

plot of chunk corr

Boxplot the results by lab

boxplot(batch_unsup_sva ~ pd$lab,ylab="Unsupervised SVA",col=trop[1])

plot of chunk plotResults

boxplot(batch_pca ~ pd$lab,ylab="Unsupervised PCA",col=trop[2])

plot of chunk plotResults

boxplot(batch_ruv_res ~ pd$lab,ylab="Unsupervised RUV residuals",col=trop[3])

plot of chunk plotResults

boxplot(batch_ruv_emp ~ pd$lab,ylab="Unsupervised RUV control probes",col=trop[4])

plot of chunk plotResults

ANOVA between estimates and lab

anova(lm(batch_unsup_sva ~ pd$lab))

## Analysis of Variance Table
## 
## Response: batch_unsup_sva
##            Df Sum Sq Mean Sq F value Pr(>F)    
## pd$lab      6  0.864  0.1439     482 <2e-16 ***
## Residuals 457  0.136  0.0003                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(batch_pca ~ pd$lab))

## Analysis of Variance Table
## 
## Response: batch_pca
##            Df Sum Sq Mean Sq F value Pr(>F)    
## pd$lab      6  0.857  0.1428     456 <2e-16 ***
## Residuals 457  0.143  0.0003                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(batch_ruv_res~ pd$lab))

## Analysis of Variance Table
## 
## Response: batch_ruv_res
##            Df Sum Sq Mean Sq F value Pr(>F)    
## pd$lab      6  0.583  0.0971     106 <2e-16 ***
## Residuals 457  0.417  0.0009                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(batch_ruv_emp~ pd$lab))

## Analysis of Variance Table
## 
## Response: batch_ruv_emp
##            Df Sum Sq Mean Sq F value Pr(>F)    
## pd$lab      6  0.712  0.1187     189 <2e-16 ***
## Residuals 457  0.288  0.0006                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Differential expression results

dge <- DGEList(counts=tfpkm)
dge <- calcNormFactors(dge)
catplots = tstats = vector("list",6)
adj = c("+ pd$lab","+ batch_pca", "+ batch_unsup_sva",
        "+ batch_ruv_res", "+ batch_ruv_emp", "")

for(i in 1:6){
  design = model.matrix(as.formula(paste0("~ pd$population",adj[i])))
  v <- voom(dge,design,plot=FALSE)
  fit <- lmFit(v,design)
  fit <- eBayes(fit)
  tstats[[i]] = abs(fit$t[,2])
  names(tstats[[i]]) = 1:length(tstats[[i]])
  catplots[[i]] = CATplot(-rank(tstats[[i]]),-rank(tstats[[1]]),maxrank=3000,make.plot=F)
  cat(i)
}

## 123456

plot(catplots[[2]],ylim=c(0,1),col=trop[1],lwd=3,type="l",ylab="Concordance Between True rank and rank with different methods",xlab="Rank")
lines(catplots[[3]],col=trop[2],lwd=3)
lines(catplots[[4]],col=trop[3],lwd=3,lty=2)
lines(catplots[[5]],col=trop[3],lwd=3,lty=1)
lines(catplots[[6]],col=trop[4],lwd=3,lty=2)


legend(200,0.5,legend=c("PCA", "Unsup. svaseq","RUV Res.", "RUV Emp.", "No adjustment"),col=trop[c(1,2,3,3,4)],lty=c(1,1,2,1,2),lwd=3)

plot of chunk comparede

Session Information

sessionInfo()

## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] zebrafishRNASeq_0.99.2   sva_3.11.3              
##  [3] RUVSeq_0.99.2            RSkittleBrewer_1.1      
##  [5] RColorBrewer_1.0-5       mgcv_1.7-29             
##  [7] nlme_3.1-117             genefilter_1.47.5       
##  [9] ffpe_1.9.0               TTR_0.22-0              
## [11] xts_0.9-7                zoo_1.7-11              
## [13] edgeR_3.7.2              limma_3.21.6            
## [15] EDASeq_1.99.1            ShortRead_1.23.11       
## [17] GenomicAlignments_1.1.11 Rsamtools_1.17.27       
## [19] GenomicRanges_1.17.17    GenomeInfoDb_1.1.6      
## [21] devtools_1.5             corrplot_0.73           
## [23] Biostrings_2.33.10       XVector_0.5.6           
## [25] IRanges_1.99.15          S4Vectors_0.0.8         
## [27] BiocParallel_0.7.2       Biobase_2.25.0          
## [29] BiocGenerics_0.11.2      ballgown_0.99.0         
## [31] knitr_1.6               
## 
## loaded via a namespace (and not attached):
##  [1] affy_1.43.2               affyio_1.33.0            
##  [3] annotate_1.43.4           AnnotationDbi_1.27.7     
##  [5] aroma.light_2.1.0         base64_1.1               
##  [7] BatchJobs_1.2             BBmisc_1.6               
##  [9] beanplot_1.1              BiocInstaller_1.15.5     
## [11] biomaRt_2.21.0            biovizBase_1.13.7        
## [13] bitops_1.0-6              brew_1.0-6               
## [15] BSgenome_1.33.8           bumphunter_1.5.3         
## [17] cluster_1.15.2            codetools_0.2-8          
## [19] colorspace_1.2-4          cummeRbund_2.7.2         
## [21] DBI_0.2-7                 DESeq_1.17.0             
## [23] dichromat_2.0-0           digest_0.6.4             
## [25] doRNG_1.6                 evaluate_0.5.5           
## [27] fail_1.2                  foreach_1.4.2            
## [29] formatR_0.10              Formula_1.1-1            
## [31] geneplotter_1.43.0        GenomicFeatures_1.17.9   
## [33] grid_3.1.0                Gviz_1.9.9               
## [35] Hmisc_3.14-4              htmltools_0.2.4          
## [37] httr_0.3                  hwriter_1.3              
## [39] illuminaio_0.7.0          iterators_1.0.7          
## [41] KernSmooth_2.23-12        lattice_0.20-29          
## [43] latticeExtra_0.6-26       locfit_1.5-9.1           
## [45] lumi_2.17.0               MASS_7.3-33              
## [47] Matrix_1.1-3              matrixStats_0.8.14       
## [49] mclust_4.3                memoise_0.2.1            
## [51] methylumi_2.11.1          minfi_1.11.7             
## [53] multtest_2.21.0           munsell_0.4.2            
## [55] nleqslv_2.1.1             nor1mix_1.1-4            
## [57] pkgmaker_0.22             plyr_1.8.1               
## [59] preprocessCore_1.27.0     R.methodsS3_1.6.1        
## [61] R.oo_1.18.0               Rcpp_0.11.1              
## [63] RCurl_1.95-4.1            registry_0.2             
## [65] reshape_0.8.5             rmarkdown_0.2.46         
## [67] rngtools_1.2.4            RSQLite_0.11.4           
## [69] rtracklayer_1.25.9        scales_0.2.4             
## [71] sendmailR_1.1-2           sfsmisc_1.0-25           
## [73] siggenes_1.39.0           splines_3.1.0            
## [75] stats4_3.1.0              stringr_0.6.2            
## [77] survival_2.37-7           tools_3.1.0              
## [79] VariantAnnotation_1.11.12 whisker_0.3-2            
## [81] XML_3.98-1.1              xtable_1.7-3             
## [83] yaml_2.1.11               zlibbioc_1.11.1

processgeuvadis.Rmd

Jeff Leek

Load packages

Load the data

Subset to non-duplicates

Run the different analyses and compare to lab

Correlation between estimates

Boxplot the results by lab

ANOVA between estimates and lab

Differential expression results

Session Information