DaffyDoc
DaffyDoc

Reputation: 21

GenomicRanges add coverage

I'm working on RNA seq data and trying to plot average coverage profiles by genotype, similarly to what is done here

RNA seq Coverage per genotype (Source: pickrell et al, Nature, 2010)

enter image description here

To do this plot, I have bigwig files from 100 individuals, that contain coverage information from RNA-seq data (in a specific region) and that I read in R, as GenomicRanges objects.

this gives me GRanges objects such as those obtained in the following toy example :

gr1=GRanges(seqname=1,range=IRanges(start=c(1,5,10,15,30,55), end=c(4,9,14,29,39,60)))

gr1$cov=c(3,1,8,6,2,10)

gr2=GRanges(seqname=1,range=IRanges(start=c(3,20,24), end=c(7,23,26)))

gr2$cov=c(3,5,3)

start=unique(sort(c(ranges(gr1)@start,ranges(gr2)@start)))

gr1

GRanges object with 6 ranges and 1 metadata column:
seqnames    ranges strand |       cov
   <Rle> <IRanges>  <Rle> | <numeric>
       1  [ 1,  4]      * |         3
       1  [ 5,  9]      * |         1
       1  [10, 14]      * |         8
       1  [15, 29]      * |         6
       1  [30, 39]      * |         2
       1  [55, 60]      * |        10 
        -------
 seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr2

GRanges object with 3 ranges and 1 metadata column:
seqnames    ranges strand |       cov
   <Rle> <IRanges>  <Rle> | <numeric>
       1  [ 3,  7]      * |         3
       1  [20, 23]      * |         5
       1  [24, 26]      * |         3
       -------
 seqinfo: 1 sequence from an unspecified genome; no seqlengths

The problem is that I have these per individual (gr1 and gr2 would be 2 different individuals), and I'd like to combine them to create a genomic ranges object that gives me the total coverage at each position across individuals, 1 and 2 that would look as follows :

gr3

GRanges object with 6 ranges and 1 metadata column:
seqnames    ranges strand |       cov
   <Rle> <IRanges>  <Rle> | <numeric>
       1  [ 1,  2]      * |         3
       1  [ 3,  4]      * |         6 (=3+3)
       1  [ 5,  7]      * |         4 (=1+3)
       1  [ 8,  9]      * |         1
       1  [10, 14]      * |         8
       1  [15, 19]      * |         6
       1  [20, 23]      * |         11 (=6+5)
       1  [24, 26]      * |         9 (=6+3)
       1  [27, 29]      * |         6
       1  [30, 39]      * |         2
       1  [55, 60]      * |        10 

Does anyone know a simple way to do this ? or am I doomed ?

Thanks for your answers.

PS: my data isn't stranded, but if you have it for stranded data it's even better.

PPS: Ideally, I would also like to be able to to a multiplication, or apply any function with two arguments x and y, instead of simply adding the coverage.

Upvotes: 2

Views: 1645

Answers (1)

Luis
Luis

Reputation: 511

It's been almost a year, but here's my answer for future reference.

Whenever I don't find a function to do a task such as this one directly, I simply expand the GRanges objects to single-bp resolution. This allows me to perform any required operation on the metadata columns treating them as simple data.frame columns, since the IRanges are now matched between the two Granges objects.

In the specific case of this question, the following works.

### Sort seqlevels
# (not necessary here, but in real world examples,
# with multiple sequences, you will want to do this)
gr1 <- sort(GenomeInfoDb::sortSeqlevels(gr1))
gr2 <- sort(GenomeInfoDb::sortSeqlevels(gr2))

### Add seqlengths
# (this corresponds to the actual sequence lengths;
# here we use the highest position between the two objects: 60)
seqlengths(gr1) <- 60

### Make 1-bp tiles covering the genome
# (using either one of gr1 and gr2 as a reference)
bins <- GenomicRanges::tileGenome(GenomeInfoDb::seqlengths(gr1),
                                  tilewidth=1,
                                  cut.last.tile.in.chrom=TRUE)

### Get coverage signal as Rle object
gr1_cov <- coverage(gr1, weight="cov")
gr2_cov <- coverage(gr2, weight="cov")

### Get average coverage in each bin
# (since the bins are 1-bp wide, this just keeps the original coverage value)
gr1_bins <- GenomicRanges::binnedAverage(bins, gr1_cov, "binned_cov")
gr2_bins <- GenomicRanges::binnedAverage(bins, gr2_cov, "binned_cov")

### Make final object:
# We can now sum the values in the metadata columns
# Addressing the PPS, you could do any other operation or apply a function
gr3 <- gr1_bins
gr3$binned_cov <- gr1_bins$binned_cov + gr2_bins$binned_cov

This yields the final GRanges object at single-bp resolution.

> gr3

GRanges object with 60 ranges and 1 metadata column:
     seqnames    ranges strand | binned_cov
        <Rle> <IRanges>  <Rle> |  <numeric>
 [1]        1    [1, 1]      * |          3
 [2]        1    [2, 2]      * |          3
 [3]        1    [3, 3]      * |          6
 [4]        1    [4, 4]      * |          6
 [5]        1    [5, 5]      * |          4
 ...      ...       ...    ... .        ...
[56]        1  [56, 56]      * |         10
[57]        1  [57, 57]      * |         10
[58]        1  [58, 58]      * |         10
[59]        1  [59, 59]      * |         10
[60]        1  [60, 60]      * |         10
-------
seqinfo: 1 sequence from an unspecified genome

To compress it and get the exact gr3 in the question, we can do the following.

### Compress back to variable-width IRanges (by cov)
gr3_Rle <- coverage(gr3, weight='binned_cov')
gr3 <- as(gr3_Rle, "GRanges")

### Drop 0-score rows
gr3 <- gr3[gr3$score > 0]

### Rename metadata column
names(mcols(gr3)) <- 'cov'

> gr3

GRanges object with 11 ranges and 1 metadata column:
       seqnames    ranges strand |       cov
          <Rle> <IRanges>  <Rle> | <numeric>
   [1]        1  [ 1,  2]      * |         3
   [2]        1  [ 3,  4]      * |         6
   [3]        1  [ 5,  7]      * |         4
   [4]        1  [ 8,  9]      * |         1
   [5]        1  [10, 14]      * |         8
   [6]        1  [15, 19]      * |         6
   [7]        1  [20, 23]      * |        11
   [8]        1  [24, 26]      * |         9
   [9]        1  [27, 29]      * |         6
  [10]        1  [30, 39]      * |         2
  [11]        1  [55, 60]      * |        10
  -------
  seqinfo: 1 sequence from an unspecified genome

Upvotes: 5

Related Questions