user2120870
user2120870

Reputation: 869

Concatenate individual genomic intervals into populational regions

I would like to concatenate individual genomic intervals into common regions.

My input:

dfin <- "chr start end sample type
        1   10    20   NE1    loss
        1   5     15   NE2    gain
        1   25    30   NE1    gain
        2   40    50   NE1    loss
        2   40    60   NE2    loss
        3   20    30   NE1    gain"
dfin <- read.table(text=dfin, header=T)

My expected output:

dfout <- "chr start end samples type
        1   5     20   NE1-NE2  both
        1   25    30   NE1      gain
        2   40    60   NE1-NE2  loss
        3   20    30   NE1      gain"
dfout <- read.table(text=dfout, header=T)

The intervals in dfin will never overlap in the same animal, just between animals (columns sample and samples, respectively). The column type have two factors (lossand gain) in dfin and is expected to have three factors in dfout (loss, gain and both, which occur when the concatenated region in dfout was based on both loss and gain).

Any idea to deal with that?

*Updated for @David Arenburg

Upvotes: 3

Views: 153

Answers (2)

alexis_laz
alexis_laz

Reputation: 13122

(expanding on the comment) You could use the "IRanges" package:

library(IRanges)

#build an appropriate object
dat = RangedData(space = dfin$chr, 
                 IRanges(dfin$start, dfin$end), 
                 sample = dfin$sample, 
                 type = dfin$type)
dat
#concatenate overlaps with an extra step of saving the concatenation mappings
ans = RangedData(reduce(ranges(dat), with.revmap = TRUE))
ans

Could not figure out how to avoid reduce losing columns of "RangedData" object, but having the mappings saved we could do something like (there might be a more appropriate -according to "IRanges"- way to extract mappings, but I was not able to find it):

tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning
maps = rep(seq_along(start(tmp)), width(tmp))
maps
#[1] 1 1 2 3 3 4

Having the mappings of interval concatenation, we could aggregate "sample" and "type" to get the final form. E.g.:

tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-"))
#        1         2         3         4 
#"NE1-NE2"     "NE1" "NE1-NE2"     "NE1"

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92292

Here's an attempt to group the intervals using data.table::foverlaps and then calculate all the rest

library(data.table)
setkey(setDT(dfin), chr, start, end)
res <- foverlaps(dfin, dfin, which = TRUE)[, toString(xid), by = yid
                                           ][, indx := .GRP, by = V1]$indx
dfin[, .(
          chr = chr[1L],
          start = min(start), 
          end = max(end), 
          samples = paste(unique(sample), collapse = "-"),
          type = if(uniqueN(type) > 1L) "both" else as.character(type[1L])
         ),
       by = res]

#    res chr start end samples type
# 1:   1   1     5  20 NE2-NE1 both
# 2:   2   1    25  30     NE1 gain
# 3:   3   2    40  60 NE1-NE2 loss
# 4:   4   3    20  30     NE1 gain

Upvotes: 3

Related Questions