user3091668
user3091668

Reputation: 2310

Create a matrix to show overlaps among multiple GRanges

I am trying to find a way to efficiently extract a matrix showing '0' or '1' when comparing different GRange objects. In my example:

df <- data.frame(chr = c("chr1", "chr10"), start = c(1,4), end=c(2, 4))
gr.1 <- makeGRangesFromDataFrame(df)

df <- data.frame(chr = c("chr1", "chr10"), start = c(2,3), end=c(2, 4))
gr.2 <- makeGRangesFromDataFrame(df)

df <- data.frame(chr = c("chr1"), start = c(1), end=c(1))
gr.3 <- makeGRangesFromDataFrame(df)

I tried findOverlaps to evaluate the overlaps among these regions but apparently it can't deal with more than two GRanges:

> GenomicRanges::findOverlaps(gr.1, gr.2, gr.3)
> Error in IRanges:::NCList_find_overlaps_in_groups(ranges(query),
> q_space,  :    'maxgap' must be a single integer

Moreover, my required output would be something like this example data-frame:

out <- "gr.1 gr.2 gr.3
chr1-1 1  0  1
chr1-2 1  1  0
chr10-3 0 1  0
chr10-4 1 1  0"

out <- read.table(text=out, header=TRUE)

Any idea to wisely export it?

Upvotes: 2

Views: 1302

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50728

First off, here is a partial solution that shows only the overlapping regions between the first and any additional GRanges (this should generate results similar to those from bedtools intersect which allows one to "identify overlaps between a single query (-a) file and multiple database files (-b) at once"); this should be a good starting point for further refinement.

We can define a function that takes any number of GRanges and identifies overlapping ranges between the first GRanges and any additional GRanges using findOverlaps; the intersecting regions are then obtained from pintersect.

Please note that I make use of the common tidyverse syntax; while this is not strictly necessary (for every purrr::map/purrr::map2 function one can use their base R lapply/mapply equivalents), I prefer the tidyverse approach for code readability.

multiOverlap <- function(...) {
    require(GenomicRanges)
    require(tidyverse)

    # Store GRanges in list
    lst <- list(...)
    names(lst) <- paste0("gr", 1:length(lst))

    # Calculate mutual overlaps
    lst.matches <- map(lst[-1L], ~ findOverlaps(lst[[1L]], .x))

    # List of intersecting regions
    lst.gr <- map2(
        lst[-1L], lst.matches,
        ~pintersect(lst[[1]][queryHits(.y)], .x[subjectHits(.y)]))
    names(lst.gr) <- paste0("gr1-gr", 2:length(lst))

    # Convert GRanges to data.frame and reshape data
    map(lst.gr, ~.x %>%
        as.data.frame() %>%
        unite(locus, seqnames, start, sep = "-") %>%
        select(locus)) %>%
        bind_rows(.id = "id") %>%
        separate(id, into = c("grx", "gry")) %>%
        gather(gr, no, -locus) %>%
        transmute(
            locus,
            no,
            val = 1) %>%
            spread(no, val, fill = 0)
}

When we apply this function to the three sample GRanges we get the following result

multiOverlap(gr.1, gr.2, gr.3)
#    locus gr1 gr2 gr3
#1  chr1-1   1   0   1
#2  chr1-2   1   1   0
#3 chr10-4   1   1   0

Update

Another (fast) option might be to use data.table; especially when working with genomic data data.tables pass-by-reference properties, avoiding deep copies, makes it very attractive (and fast).

Here is a solution that exactly reproduces your expected output

# Load the library
library(data.table)

# Convert GRanges to data.table and row-bind entries
dt <- rbindlist(
    lapply(list(gr.1 = gr.1, gr.2 = gr.2, gr.3 = gr.3), as.data.table),
    idcol = "id")

# Remove width and strand
dt[, c("width", "strand") := NULL]

# Expand rows by range using start and end
dt <- dt[, .(pos = seq(start, end, by = 1L)), by = .(id, seqnames, grp = 1:nrow(dt))]

# Remove helper group label
dt[, grp := NULL]

# Unite seqnames and pos into one column
dt <- dt[, .(locus = do.call(paste, c(.SD, sep = "-")), id, pos), .SDcols = seqnames:pos]

# Add count variable
dt[, ct := 1]

# Convert from long to wide
dcast(dt, locus ~ id, value.var = "ct", fill = 0)
#     locus gr.1 gr.2 gr.3
#1:  chr1-1    1    0    1
#2:  chr1-2    1    1    0
#3: chr10-3    0    1    0
#4: chr10-4    1    1    0

And we're done:-) It's easy to wrap above lines in a convenience function if necessary.

Upvotes: 1

Related Questions