This protocol was contributed by L. Collado-Torres as is available at jtleek.com/protocols/.
This short protocol shows how you can get a gene-level count matrix from BigWig files. We'll work with example data from derfinderData
(Collado-Torres, Jaffe, and Leek, 2015) and generate the count matrix using some custom code.
First, lets locate a set of BigWig files from the amygdala.
## Load data
library('derfinderData')
## Locate bigWig files
files <- dir(system.file('extdata', 'AMY', package = 'derfinderData'), full.names = TRUE)
names(files) <- gsub('\\.bw', '', dir(system.file('extdata', 'AMY', package = 'derfinderData')))
head(files)
## HSB113
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB113.bw"
## HSB123
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB123.bw"
## HSB126
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB126.bw"
## HSB130
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB130.bw"
## HSB135
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB135.bw"
## HSB136
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB136.bw"
Next, lets define our set of genes.
library('TxDb.Hsapiens.UCSC.hg19.knownGene')
## example data only has coverage for chr21, otherwise use all
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, 'chr21')
## Subset set of genes if you are not using all of them
ex <- exonsBy(txdb, by = 'gene')
We can see that it's a GRangesList.
class(ex)
## [1] "GRangesList"
## attr(,"package")
## [1] "GenomicRanges"
ex
## GRangesList object of length 296:
## $100129027
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## |
## [1] chr21 [47247755, 47251155] - | 260437
## [2] chr21 [47252522, 47252572] - | 260438
## [3] chr21 [47256200, 47256333] - | 260439
##
## $100131902
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## [1] chr21 [31661463, 31661832] - | 259463
##
## $100132288
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## [1] chr21 [9907189, 9907492] - | 259089
## [2] chr21 [9907189, 9908432] - | 259090
## [3] chr21 [9909047, 9909277] - | 259091
## [4] chr21 [9915250, 9916547] - | 259092
## [5] chr21 [9966322, 9966380] - | 259094
## [6] chr21 [9968516, 9968593] - | 259095
##
## ...
## <293 more elements>
## -------
## seqinfo: 1 sequence from hg19 genome
We can use a similar approach to bigwig_DEanalysis but it's a little bit more complicated because we are dealing with a GRangesList instead of a GRanges object.
First, we can create an exon level count matrix using rtracklayer
(Lawrence, Gentleman, and Carey, 2009).
## Import data and create exon count matrix
library('rtracklayer')
bw <- BigWigFileList(files)
exons <- unlist(ex)
counts_exons <- matrix(NA, nrow = length(exons), ncol = length(bw))
colnames(counts_exons) <- names(bw)
for(i in seq_len(length(bw))) {
## For more than one chr, you'll need to change this a bit
coverage <- import(bw[[i]], as = 'RleList')$chr21
counts_exons[, i] <- sum(Views(coverage, ranges(exons)))
## For more than one chr, you can use this code per chr or...
## something like this.
# coverage <- import(bw[[i]], as = 'RleList')
# for(chr in unique(seqnames(exons))) {
# counts_exons[seqnames(exons) == chr, i] <- sum(Views(coverage[[chr]], ranges(exons[seqnames(exons) == chr])))
# }
}
## Divide by read length and round to integer numbers
counts_exons <- round(counts_exons / 76, 0)
## Lets explore the results
dim(counts_exons)
## [1] 2822 12
head(counts_exons[, 1:6])
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 1 3 1 3 1
## [6,] 0 3 7 5 6 3
We can then summary the exon count matrix into a gene count matrix.
## Create gene count matrix
counts_genes <- matrix(NA, nrow = length(ex), ncol = length(bw))
colnames(counts_genes) <- names(bw)
for(i in seq_len(length(bw))) {
counts_genes[, i] <- sapply(split(counts_exons[, i], names(exons)), sum)
}
## Add gene names
rownames(counts_genes) <- names(ex)
## Explore the results
dim(counts_genes)
## [1] 296 12
head(counts_genes[, 1:6])
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136
## 100129027 0 0 0 0 0 0
## 100131902 0 0 0 0 0 0
## 100132288 0 8 16 9 21 5
## 100133286 5 11 16 27 7 21
## 100151643 0 0 0 0 0 0
## 100288287 0 0 0 0 0 0
If you are interested in normalizing the gene counts to those from a given library size, say 40 million, you can use code like this to do so. You will need the total number of mapped reads. If you don't have that number, you can either calculate the sum of the coverage by sample and divide it by the read length or use the gene counts sum (by sample).
## You might also want to normalize to a given library size, say 40 million
# here is a proxy for total mapped reads, although you might
# have that data somewhere else
totalMapped <- colSums(counts_genes)
norm_counts <- round(counts_genes / totalMapped * 4e7, 0)
head(norm_counts[, 1:6])
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136
## 100129027 0 0 0 0 0 0
## 100131902 0 0 0 0 0 0
## 100132288 0 37480 84623 47443 98384 26445
## 100133286 24021 61324 87217 129714 39024 114473
## 100151643 0 0 0 0 0 0
## 100288287 0 0 0 0 0 0
Other approaches for creating a gene-level count matrix from BigWig files could involve using GenomicAlignments::summarizeOverlaps()
(Lawrence, Huber, Pagès, Aboyoun, et al., 2013) although the score column created by rtracklayer::import.bw()
gets ignored and you'll have to deal with it.
You could also use genomic state objects from derfinder
(Collado-Torres, Frazee, Love, Irizarry, et al., 2015). However, you have to deal with the complexity of exon overlaps and some missing ids as shown below.
library('derfinder')
## Check makeGenomicState()
## Exons from chr21 (hg19) included in the package
genomicState$fullGenome[ genomicState$fullGenome$theRegion == 'exon' ]
## GRanges object with 2658 ranges and 4 metadata columns:
## seqnames ranges strand | theRegion
## |
## 1 chr21 [9690071, 9690100] + | exon
## 2 chr21 [9692178, 9692207] + | exon
## 3 chr21 [9711935, 9712038] + | exon
## 5 chr21 [9746121, 9746151] + | exon
## 7 chr21 [9748830, 9748899] + | exon
## ... ... ... ... ... ...
## 4985 chr21 [47740937, 47744424] - | exon
## 4986 chr21 [48018531, 48019416] - | exon
## 4988 chr21 [48020195, 48020288] - | exon
## 4990 chr21 [48022191, 48022329] - | exon
## 4992 chr21 [48024926, 48025035] - | exon
## tx_id tx_name
##
## 1 72512 uc002zkg.1
## 2 72513 uc021wgt.1
## 3 72514 uc011abu.2
## 5 72514 uc011abu.2
## 7 72514 uc011abu.2
## ... ... ...
## 4985 73460,73461,73462,... uc002zjf.3,uc002zjg.1,uc010gqj.2,...
## 4986 73464,73465 uc002zju.1,uc002zjv.1
## 4988 73465 uc002zjv.1
## 4990 73464,73465 uc002zju.1,uc002zjv.1
## 4992 73464,73465 uc002zju.1,uc002zjv.1
## gene
##
## 1
## 2
## 3
## 5
## 7
## ... ...
## 4985 194
## 4986 227
## 4988 227
## 4990 227
## 4992 227
## -------
## seqinfo: 1 sequence from hg19 genome
## Load hg19 full file
load("/Users/lcollado/Dropbox/JHSPH/Code/derSoftware/GenomicState.Hsapiens.UCSC.hg19.knownGene.rda")
## Some are NA, some have more than 1 gene id
GenomicState.Hsapiens.UCSC.hg19.knownGene$codingGenome[ GenomicState.Hsapiens.UCSC.hg19.knownGene$codingGenome$theRegion == 'exon' ]$gene
## CharacterList of length 217253
## [[1]] 100287102
## [[2]] 100287102
## [[3]] 100287102
## [[4]] 79501
## [[5]]
## [[6]]
## [[7]] 100133331 100132062
## [[8]] 729759
## [[9]]
## [[10]]
## ...
## <217243 more elements>
## Complicated case
GenomicState.Hsapiens.UCSC.hg19.knownGene$fullGenome[ GenomicState.Hsapiens.UCSC.hg19.knownGene$fullGenome$theRegion == 'exon' ][10]
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | theRegion tx_id
## |
## 15 chr1 [324439, 328581] + | exon 9,7,8,...
## tx_name gene
##
## 15 uc021oeh.1,uc009vjk.2,uc001aau.3,... 100133331,100132062,NA
## symbol
##
## 15 LOC100133331,LOC100132062,NA
## -------
## seqinfo: 24 sequences from hg19 genome
Citations made with knitcitations
(Boettiger, 2015).
[1] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.6. 2015. URL: http://CRAN.R-project.org/package=knitcitations.
[2] L. Collado-Torres, A. C. Frazee, M. I. Love, R. A. Irizarry, et al. “derfinder: Software for annotation-agnostic RNA-seq differential expression analysis”. In: bioRxiv (2015). DOI: 10.1101/015370. URL: http://www.biorxiv.org/content/early/2015/02/19/015370.abstract.
[3] L. Collado-Torres, A. Jaffe and J. Leek. derfinderData: Processed BigWigs from BrainSpan for examples. R package version 0.103.1. 2015. URL: https://github.com/lcolladotor/derfinderData.
[4] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. URL: http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.
[5] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. URL: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.
R
session information
## setting value
## version R version 3.2.0 Patched (2015-05-18 r68382)
## system x86_64, darwin10.8.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## package * version date source
## acepack 1.3-3.3 2014-11-24 CRAN (R 3.2.0)
## AnnotationDbi * 1.31.17 2015-06-18 Bioconductor
## bibtex 0.4.0 2014-12-31 CRAN (R 3.2.0)
## Biobase * 2.29.1 2015-05-03 Bioconductor
## BiocGenerics * 0.15.3 2015-06-30 Bioconductor
## BiocParallel 1.3.34 2015-07-10 Bioconductor
## biomaRt 2.25.1 2015-04-21 Bioconductor
## Biostrings 2.37.2 2015-05-07 Bioconductor
## bitops 1.0-6 2013-08-17 CRAN (R 3.2.0)
## bumphunter 1.9.1 2015-05-05 Bioconductor
## cluster 2.0.2 2015-06-19 CRAN (R 3.2.1)
## codetools 0.2-11 2015-03-10 CRAN (R 3.2.0)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
## curl 0.9.1 2015-07-04 CRAN (R 3.2.1)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
## derfinder * 1.3.3 2015-06-15 Bioconductor
## derfinderData * 0.103.1 2015-06-16 Bioconductor
## derfinderHelper 1.3.1 2015-06-20 Bioconductor
## devtools * 1.8.0 2015-05-09 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## doRNG 1.6 2014-03-07 CRAN (R 3.2.0)
## evaluate 0.7 2015-04-21 CRAN (R 3.2.0)
## foreach 1.4.2 2014-04-11 CRAN (R 3.2.0)
## foreign 0.8-65 2015-07-02 CRAN (R 3.2.1)
## formatR 1.2 2015-04-21 CRAN (R 3.2.0)
## Formula 1.2-1 2015-04-07 CRAN (R 3.2.0)
## futile.logger 1.4.1 2015-04-20 CRAN (R 3.2.0)
## futile.options 1.0.0 2010-04-06 CRAN (R 3.2.0)
## GenomeInfoDb * 1.5.8 2015-06-17 Bioconductor
## GenomicAlignments 1.5.11 2015-06-30 Bioconductor
## GenomicFeatures * 1.21.13 2015-06-09 Bioconductor
## GenomicFiles 1.5.4 2015-07-03 Bioconductor
## GenomicRanges * 1.21.16 2015-06-30 Bioconductor
## ggplot2 1.0.1.9000 2015-06-12 Github (hadley/ggplot2@3b6a126)
## git2r 0.10.1.9000 2015-07-14 Github (ropensci/git2r@94831de)
## gridExtra 0.9.1 2012-08-09 CRAN (R 3.2.0)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
## Hmisc 3.16-0 2015-04-30 CRAN (R 3.2.0)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## httr 1.0.0 2015-06-25 CRAN (R 3.2.1)
## IRanges * 2.3.14 2015-07-03 Bioconductor
## iterators 1.0.7 2014-04-11 CRAN (R 3.2.0)
## knitcitations * 1.0.6 2015-05-26 CRAN (R 3.2.0)
## knitr 1.10.5 2015-05-06 CRAN (R 3.2.0)
## knitrBootstrap 1.0.0 2015-06-09 Github (jimhester/knitrBootstrap@76c41f0)
## lambda.r 1.1.7 2015-03-20 CRAN (R 3.2.0)
## lattice 0.20-31 2015-03-30 CRAN (R 3.2.0)
## latticeExtra 0.6-26 2013-08-15 CRAN (R 3.2.0)
## locfit 1.5-9.1 2013-04-20 CRAN (R 3.2.0)
## lubridate 1.3.3 2013-12-31 CRAN (R 3.2.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## markdown 0.7.7 2015-04-22 CRAN (R 3.2.0)
## MASS 7.3-42 2015-07-02 CRAN (R 3.2.1)
## Matrix 1.2-2 2015-07-08 CRAN (R 3.2.1)
## matrixStats 0.14.2 2015-06-24 CRAN (R 3.2.1)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
## nnet 7.3-10 2015-06-29 CRAN (R 3.2.1)
## pkgmaker 0.22 2014-05-14 CRAN (R 3.2.0)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
## proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
## qvalue 2.1.0 2015-04-17 Bioconductor
## R6 2.1.0 2015-07-04 CRAN (R 3.2.1)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.0)
## Rcpp 0.11.6 2015-05-01 CRAN (R 3.2.0)
## RCurl 1.95-4.7 2015-06-30 CRAN (R 3.2.1)
## RefManageR 0.8.63 2015-06-09 CRAN (R 3.2.0)
## registry 0.3 2015-07-08 CRAN (R 3.2.1)
## reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
## RJSONIO 1.3-0 2014-07-28 CRAN (R 3.2.0)
## rmarkdown * 0.7 2015-06-13 CRAN (R 3.2.0)
## rngtools 1.2.4 2014-03-06 CRAN (R 3.2.0)
## rpart 4.1-10 2015-06-29 CRAN (R 3.2.1)
## Rsamtools 1.21.13 2015-07-09 Bioconductor
## RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
## rtracklayer * 1.29.12 2015-07-03 Bioconductor
## rversions 1.0.2 2015-07-13 CRAN (R 3.2.1)
## S4Vectors * 0.7.10 2015-07-03 Bioconductor
## scales 0.2.5 2015-06-12 CRAN (R 3.2.0)
## stringi 0.5-5 2015-06-29 CRAN (R 3.2.1)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## SummarizedExperiment 0.3.2 2015-07-03 Bioconductor
## survival 2.38-3 2015-07-02 CRAN (R 3.2.1)
## TxDb.Hsapiens.UCSC.hg19.knownGene * 3.1.3 2015-05-19 Bioconductor
## XML 3.98-1.3 2015-06-30 CRAN (R 3.2.0)
## xml2 0.1.1 2015-06-02 CRAN (R 3.2.0)
## xtable 1.7-4 2014-09-12 CRAN (R 3.2.0)
## XVector 0.9.1 2015-04-30 Bioconductor
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
## zlibbioc 1.15.0 2015-04-17 Bioconductor