Reputation: 21
I'm working on RNA seq data and trying to plot average coverage profiles by genotype, similarly to what is done here
RNA seq Coverage per genotype (Source: pickrell et al, Nature, 2010)
To do this plot, I have bigwig files from 100 individuals, that contain coverage information from RNA-seq data (in a specific region) and that I read in R, as GenomicRanges objects.
this gives me GRanges objects such as those obtained in the following toy example :
gr1=GRanges(seqname=1,range=IRanges(start=c(1,5,10,15,30,55), end=c(4,9,14,29,39,60)))
gr1$cov=c(3,1,8,6,2,10)
gr2=GRanges(seqname=1,range=IRanges(start=c(3,20,24), end=c(7,23,26)))
gr2$cov=c(3,5,3)
start=unique(sort(c(ranges(gr1)@start,ranges(gr2)@start)))
gr1
GRanges object with 6 ranges and 1 metadata column:
seqnames ranges strand | cov
<Rle> <IRanges> <Rle> | <numeric>
1 [ 1, 4] * | 3
1 [ 5, 9] * | 1
1 [10, 14] * | 8
1 [15, 29] * | 6
1 [30, 39] * | 2
1 [55, 60] * | 10
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr2
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | cov
<Rle> <IRanges> <Rle> | <numeric>
1 [ 3, 7] * | 3
1 [20, 23] * | 5
1 [24, 26] * | 3
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The problem is that I have these per individual (gr1 and gr2 would be 2 different individuals), and I'd like to combine them to create a genomic ranges object that gives me the total coverage at each position across individuals, 1 and 2 that would look as follows :
gr3
GRanges object with 6 ranges and 1 metadata column:
seqnames ranges strand | cov
<Rle> <IRanges> <Rle> | <numeric>
1 [ 1, 2] * | 3
1 [ 3, 4] * | 6 (=3+3)
1 [ 5, 7] * | 4 (=1+3)
1 [ 8, 9] * | 1
1 [10, 14] * | 8
1 [15, 19] * | 6
1 [20, 23] * | 11 (=6+5)
1 [24, 26] * | 9 (=6+3)
1 [27, 29] * | 6
1 [30, 39] * | 2
1 [55, 60] * | 10
Does anyone know a simple way to do this ? or am I doomed ?
Thanks for your answers.
PS: my data isn't stranded, but if you have it for stranded data it's even better.
PPS: Ideally, I would also like to be able to to a multiplication, or apply any function with two arguments x and y, instead of simply adding the coverage.
Upvotes: 2
Views: 1645
Reputation: 511
It's been almost a year, but here's my answer for future reference.
Whenever I don't find a function to do a task such as this one directly, I simply expand the GRanges
objects to single-bp resolution. This allows me to perform any required operation on the metadata columns treating them as simple data.frame
columns, since the IRanges
are now matched between the two Granges
objects.
In the specific case of this question, the following works.
### Sort seqlevels
# (not necessary here, but in real world examples,
# with multiple sequences, you will want to do this)
gr1 <- sort(GenomeInfoDb::sortSeqlevels(gr1))
gr2 <- sort(GenomeInfoDb::sortSeqlevels(gr2))
### Add seqlengths
# (this corresponds to the actual sequence lengths;
# here we use the highest position between the two objects: 60)
seqlengths(gr1) <- 60
### Make 1-bp tiles covering the genome
# (using either one of gr1 and gr2 as a reference)
bins <- GenomicRanges::tileGenome(GenomeInfoDb::seqlengths(gr1),
tilewidth=1,
cut.last.tile.in.chrom=TRUE)
### Get coverage signal as Rle object
gr1_cov <- coverage(gr1, weight="cov")
gr2_cov <- coverage(gr2, weight="cov")
### Get average coverage in each bin
# (since the bins are 1-bp wide, this just keeps the original coverage value)
gr1_bins <- GenomicRanges::binnedAverage(bins, gr1_cov, "binned_cov")
gr2_bins <- GenomicRanges::binnedAverage(bins, gr2_cov, "binned_cov")
### Make final object:
# We can now sum the values in the metadata columns
# Addressing the PPS, you could do any other operation or apply a function
gr3 <- gr1_bins
gr3$binned_cov <- gr1_bins$binned_cov + gr2_bins$binned_cov
This yields the final GRanges
object at single-bp resolution.
> gr3
GRanges object with 60 ranges and 1 metadata column:
seqnames ranges strand | binned_cov
<Rle> <IRanges> <Rle> | <numeric>
[1] 1 [1, 1] * | 3
[2] 1 [2, 2] * | 3
[3] 1 [3, 3] * | 6
[4] 1 [4, 4] * | 6
[5] 1 [5, 5] * | 4
... ... ... ... . ...
[56] 1 [56, 56] * | 10
[57] 1 [57, 57] * | 10
[58] 1 [58, 58] * | 10
[59] 1 [59, 59] * | 10
[60] 1 [60, 60] * | 10
-------
seqinfo: 1 sequence from an unspecified genome
To compress it and get the exact gr3
in the question, we can do the following.
### Compress back to variable-width IRanges (by cov)
gr3_Rle <- coverage(gr3, weight='binned_cov')
gr3 <- as(gr3_Rle, "GRanges")
### Drop 0-score rows
gr3 <- gr3[gr3$score > 0]
### Rename metadata column
names(mcols(gr3)) <- 'cov'
> gr3
GRanges object with 11 ranges and 1 metadata column:
seqnames ranges strand | cov
<Rle> <IRanges> <Rle> | <numeric>
[1] 1 [ 1, 2] * | 3
[2] 1 [ 3, 4] * | 6
[3] 1 [ 5, 7] * | 4
[4] 1 [ 8, 9] * | 1
[5] 1 [10, 14] * | 8
[6] 1 [15, 19] * | 6
[7] 1 [20, 23] * | 11
[8] 1 [24, 26] * | 9
[9] 1 [27, 29] * | 6
[10] 1 [30, 39] * | 2
[11] 1 [55, 60] * | 10
-------
seqinfo: 1 sequence from an unspecified genome
Upvotes: 5