Reputation: 2310
I am trying to find a way to efficiently extract a matrix showing '0' or '1' when comparing different GRange
objects. In my example:
df <- data.frame(chr = c("chr1", "chr10"), start = c(1,4), end=c(2, 4))
gr.1 <- makeGRangesFromDataFrame(df)
df <- data.frame(chr = c("chr1", "chr10"), start = c(2,3), end=c(2, 4))
gr.2 <- makeGRangesFromDataFrame(df)
df <- data.frame(chr = c("chr1"), start = c(1), end=c(1))
gr.3 <- makeGRangesFromDataFrame(df)
I tried findOverlaps
to evaluate the overlaps among these regions but apparently it can't deal with more than two GRanges
:
> GenomicRanges::findOverlaps(gr.1, gr.2, gr.3)
> Error in IRanges:::NCList_find_overlaps_in_groups(ranges(query),
> q_space, : 'maxgap' must be a single integer
Moreover, my required output would be something like this example data-frame:
out <- "gr.1 gr.2 gr.3
chr1-1 1 0 1
chr1-2 1 1 0
chr10-3 0 1 0
chr10-4 1 1 0"
out <- read.table(text=out, header=TRUE)
Any idea to wisely export it?
Upvotes: 2
Views: 1302
Reputation: 50728
First off, here is a partial solution that shows only the overlapping regions between the first and any additional GRanges
(this should generate results similar to those from bedtools intersect
which allows one to "identify overlaps between a single query (-a) file and multiple database files (-b) at once"); this should be a good starting point for further refinement.
We can define a function that takes any number of GRanges
and identifies overlapping ranges between the first GRanges
and any additional GRanges
using findOverlaps
; the intersecting regions are then obtained from pintersect
.
Please note that I make use of the common tidyverse
syntax; while this is not strictly necessary (for every purrr::map
/purrr::map2
function one can use their base R lapply
/mapply
equivalents), I prefer the tidyverse
approach for code readability.
multiOverlap <- function(...) {
require(GenomicRanges)
require(tidyverse)
# Store GRanges in list
lst <- list(...)
names(lst) <- paste0("gr", 1:length(lst))
# Calculate mutual overlaps
lst.matches <- map(lst[-1L], ~ findOverlaps(lst[[1L]], .x))
# List of intersecting regions
lst.gr <- map2(
lst[-1L], lst.matches,
~pintersect(lst[[1]][queryHits(.y)], .x[subjectHits(.y)]))
names(lst.gr) <- paste0("gr1-gr", 2:length(lst))
# Convert GRanges to data.frame and reshape data
map(lst.gr, ~.x %>%
as.data.frame() %>%
unite(locus, seqnames, start, sep = "-") %>%
select(locus)) %>%
bind_rows(.id = "id") %>%
separate(id, into = c("grx", "gry")) %>%
gather(gr, no, -locus) %>%
transmute(
locus,
no,
val = 1) %>%
spread(no, val, fill = 0)
}
When we apply this function to the three sample GRanges
we get the following result
multiOverlap(gr.1, gr.2, gr.3)
# locus gr1 gr2 gr3
#1 chr1-1 1 0 1
#2 chr1-2 1 1 0
#3 chr10-4 1 1 0
Another (fast) option might be to use data.table
; especially when working with genomic data data.table
s pass-by-reference properties, avoiding deep copies, makes it very attractive (and fast).
Here is a solution that exactly reproduces your expected output
# Load the library
library(data.table)
# Convert GRanges to data.table and row-bind entries
dt <- rbindlist(
lapply(list(gr.1 = gr.1, gr.2 = gr.2, gr.3 = gr.3), as.data.table),
idcol = "id")
# Remove width and strand
dt[, c("width", "strand") := NULL]
# Expand rows by range using start and end
dt <- dt[, .(pos = seq(start, end, by = 1L)), by = .(id, seqnames, grp = 1:nrow(dt))]
# Remove helper group label
dt[, grp := NULL]
# Unite seqnames and pos into one column
dt <- dt[, .(locus = do.call(paste, c(.SD, sep = "-")), id, pos), .SDcols = seqnames:pos]
# Add count variable
dt[, ct := 1]
# Convert from long to wide
dcast(dt, locus ~ id, value.var = "ct", fill = 0)
# locus gr.1 gr.2 gr.3
#1: chr1-1 1 0 1
#2: chr1-2 1 1 0
#3: chr10-3 0 1 0
#4: chr10-4 1 1 0
And we're done:-) It's easy to wrap above lines in a convenience function if necessary.
Upvotes: 1