Code and data for
Tackling the widespread and critical impact of batch effects in high-throughput data
The R scripts included below require various Bioconductor and CRAN packages. Having a full installation of Bioconductor will make this much easier. Also CRAN packages corpcor and RColorBrewer.
Example exploratory analysis
- This example uses raw data and annotation in this archive [188 MB]
- You need the myplclust function in this script.
- Then you can run the example in this script.
Affymetrix Expression Data from Spielman et al.
- We used processed data as described in the paper "Akey, J.M., Biswas, S., Leek, J.T. & Storey, J.D. On the design and analysis of gene expression studies in human populations. Nat Genet 39, 807-8; author reply 808-9 (2007).". All the data files and scripts needed are in this archive
Affymetrix Expression Data from Dyrskjot et al.
- The raw data and annotation is in this archive [188 MB]
- Use the bladder.R script to preprocess data. This will produce an rda file needed in the next step.
- Run the figure1.R script to produce Figure 1.
- Run the bladder-analysis.R script to obtain results in table.
Mass Spec Data
- This document explains the analysis performed here in detail.
- This Sweave file was used to produce the pdf above.
- You can download the rather large R object that it builds.
- Finally, to produce results seen Table 1 you run this script. You will need these csv files which are produced by the Sweave document above.
Copy Number Data
- You can obtain GWAS data from dbGap... but it is not easy. We can't post it here.
- The HapMap data is available from HapMap.
- This R package contains the code necessary to produce results in Table 1. You will find two vignettes in inst/doc: snpAnalysis.Rnw and bip.Rnw. snpAnalysis.Rnw is the analysis of the HapMap phase 3 data. bip.Rnw is the analysis of the european ancestry bipolar data.
TCGA Affymetrix Expression Data
- Download raw data.
- Use the join-affy.R script to preprocess data. You will have to edit script so that correct path for raw data is used. This will produce an rda file needed in the next step.
- Run the tcgaaffy.R script to obtain results in table.
TCGA Agilent Expression Data
- Download raw data.
- Use the join-agilent.R script to preprocess data. You will have to edit script so that correct path for raw data is used. This will produce an rda file needed in the next step.
- Run the tcgaagilent.R script to obtain results in table.
TCGA Illumina Methylation Data
- Download raw data.
- Use the join-meth.R script to preprocess data. You will have to edit script so that correct path for raw data is used. This will produce an rda file needed in the next step.
- Run the tcgamethylation.R script to obtain resultsin table.
Sec-gen sequencing data
- This archive contains the processed data, annotation, and scripts to produce Figure 2 and table results for sequencing.
- To processed data is coverage and was computed from the raw data available from the amazon cluster. This was a computationally intensive job and we ran it on the amazon cluster. The archive above has manageable tables produced from these computations. Specific instructions on how we computed coverage follow:
The table files are constructed by obtaining the .bam files for all of the relevant HapMap individuals, parsing the alignment coordinates for all aligned reads using SAMtools and perl, placing each read into a non-overlapping 1k genomic bin according to the "leftmost" position covered with respect to the forward reference strand, then tallying count totals for each bin and combining the totals into a table. In the table each successive row is a successive bin and each column is a distinct HapMap sample.
The Myrna command used (the paths are specific to our directory structure -- other users must insert their own):
perl myrna.pl \
-input s3n://myrna-emr/example/1kgenome/batch_effect/preproc \
-output s3n://ec2-jeff-tmp/example/1kgenome/batch_effect2/output \
-count s3n://ec2-jeff-tmp/example/1kgenome/batch_effect2/output/count \
-ditch-alignments \
-nodes $NODES \
-wait-on-fail \
-sam-passthrough \
-sam-lab-rg \
-influence 1 \
-from5prime \
-bin 1000 \
-bypass-pvals \
$*
The relevant output is in the path specified with the -count option.
Change of correlation structure
Finally, this python code used to produce the supplemental plot related to genes that change relative rank across batch. These are used to produce Figure 3.