Reputation: 869
I would like to concatenate individual genomic intervals into common regions.
My input:
dfin <- "chr start end sample type
1 10 20 NE1 loss
1 5 15 NE2 gain
1 25 30 NE1 gain
2 40 50 NE1 loss
2 40 60 NE2 loss
3 20 30 NE1 gain"
dfin <- read.table(text=dfin, header=T)
My expected output:
dfout <- "chr start end samples type
1 5 20 NE1-NE2 both
1 25 30 NE1 gain
2 40 60 NE1-NE2 loss
3 20 30 NE1 gain"
dfout <- read.table(text=dfout, header=T)
The intervals in dfin
will never overlap in the same animal, just between animals (columns sample
and samples
, respectively). The column type
have two factors (loss
and gain
) in dfin
and is expected to have three factors in dfout
(loss
, gain
and both
, which occur when the concatenated region in dfout
was based on both loss
and gain
).
Any idea to deal with that?
*Updated for @David Arenburg
Upvotes: 3
Views: 153
Reputation: 13122
(expanding on the comment) You could use the "IRanges" package:
library(IRanges)
#build an appropriate object
dat = RangedData(space = dfin$chr,
IRanges(dfin$start, dfin$end),
sample = dfin$sample,
type = dfin$type)
dat
#concatenate overlaps with an extra step of saving the concatenation mappings
ans = RangedData(reduce(ranges(dat), with.revmap = TRUE))
ans
Could not figure out how to avoid reduce
losing columns of "RangedData" object, but having the mappings saved we could do something like (there might be a more appropriate -according to "IRanges"- way to extract mappings, but I was not able to find it):
tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning
maps = rep(seq_along(start(tmp)), width(tmp))
maps
#[1] 1 1 2 3 3 4
Having the mappings of interval concatenation, we could aggregate "sample" and "type" to get the final form. E.g.:
tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-"))
# 1 2 3 4
#"NE1-NE2" "NE1" "NE1-NE2" "NE1"
Upvotes: 1
Reputation: 92292
Here's an attempt to group the intervals using data.table::foverlaps
and then calculate all the rest
library(data.table)
setkey(setDT(dfin), chr, start, end)
res <- foverlaps(dfin, dfin, which = TRUE)[, toString(xid), by = yid
][, indx := .GRP, by = V1]$indx
dfin[, .(
chr = chr[1L],
start = min(start),
end = max(end),
samples = paste(unique(sample), collapse = "-"),
type = if(uniqueN(type) > 1L) "both" else as.character(type[1L])
),
by = res]
# res chr start end samples type
# 1: 1 1 5 20 NE2-NE1 both
# 2: 2 1 25 30 NE1 gain
# 3: 3 2 40 60 NE1-NE2 loss
# 4: 4 3 20 30 NE1 gain
Upvotes: 3