Histogram-like summary for interval data

Question

How do I get a histogram-like summary of interval data in R?

My MWE data has four intervals.

interval  range
Int1      2-7
Int2      10-14
Int3      12-18
Int4      25-28

I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins. The function output should look like this:

bin     count  which
[0-4]   1      Int1
[5-9]   1      Int1
[10-14] 2      Int2 and Int3
[15-19] 1      Int3
[20-24] 0      None
[25-29] 1      Int4

Here the range is [minfloor(Int1, Int2, Int3, Int40), maxceil(Int1, Int2, Int3, Int4)) = [0,30) and there are six bins of size = 5.

I would greatly appreciate any pointers to R packages or functions that implement the functionality I want.

Update:

So far, I have a solution from the IRanges package which uses a fast data structure called NCList, which is faster than Interval Search Trees according to users.

> library(IRanges)
> subject <- IRanges(c(2,10,12,25), c(7,14,18,28))
> query <- IRanges(c(0,5,10,15,20,25), c(4,9,14,19,24,29))
> countOverlaps(query, subject)
[1] 1 1 2 1 0 1

But I am still unable to get which are the ranges that overlap. Will update if I get through.

Arun · Accepted Answer

Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though.

I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package:

require(data.table)
subject <- data.table(interval = paste("int", 1:4, sep=""), 
                      start = c(2,10,12,25), 
                      end = c(7,14,18,28))
query <- data.table(start = c(0,5,10,15,20,25), 
                    end = c(4,9,14,19,24,29))

setkey(subject, start, end)
ans = foverlaps(query, subject, type="any")
ans[, .(count = sum(!is.na(start)), 
        which = paste(interval, collapse=", ")), 
     by = .(i.start, i.end)]

#    i.start i.end count      which
# 1:       0     4     1       int1
# 2:       5     9     1       int1
# 3:      10    14     2 int2, int3
# 4:      15    19     1       int3
# 5:      20    24     0         NA
# 6:      25    29     1       int4

Histogram-like summary for interval data

Answers (1)

Related Questions