Reputation: 2643
How do I get a histogram-like summary of interval data in R?
My MWE data has four intervals.
interval range
Int1 2-7
Int2 10-14
Int3 12-18
Int4 25-28
I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins. The function output should look like this:
bin count which
[0-4] 1 Int1
[5-9] 1 Int1
[10-14] 2 Int2 and Int3
[15-19] 1 Int3
[20-24] 0 None
[25-29] 1 Int4
Here the range is [minfloor(Int1, Int2, Int3, Int40), maxceil(Int1, Int2, Int3, Int4)) = [0,30) and there are six bins of size = 5.
I would greatly appreciate any pointers to R packages or functions that implement the functionality I want.
Update:
So far, I have a solution from the IRanges package which uses a fast data structure called NCList, which is faster than Interval Search Trees according to users.
> library(IRanges)
> subject <- IRanges(c(2,10,12,25), c(7,14,18,28))
> query <- IRanges(c(0,5,10,15,20,25), c(4,9,14,19,24,29))
> countOverlaps(query, subject)
[1] 1 1 2 1 0 1
But I am still unable to get which are the ranges that overlap. Will update if I get through.
Upvotes: 0
Views: 140
Reputation: 118779
Using IRanges
, you should use findOverlaps
or mergeByOverlaps
instead of countOverlaps
. It, by default, doesn't return no matches though.
I'll leave that to you. Instead, will show an alternate method using foverlaps()
from data.table
package:
require(data.table)
subject <- data.table(interval = paste("int", 1:4, sep=""),
start = c(2,10,12,25),
end = c(7,14,18,28))
query <- data.table(start = c(0,5,10,15,20,25),
end = c(4,9,14,19,24,29))
setkey(subject, start, end)
ans = foverlaps(query, subject, type="any")
ans[, .(count = sum(!is.na(start)),
which = paste(interval, collapse=", ")),
by = .(i.start, i.end)]
# i.start i.end count which
# 1: 0 4 1 int1
# 2: 5 9 1 int1
# 3: 10 14 2 int2, int3
# 4: 15 19 1 int3
# 5: 20 24 0 NA
# 6: 25 29 1 int4
Upvotes: 1