Reputation: 705
I have two data.tables that provide sequence coordinates across different chromosomes (categories). For example:
library(data.table)
dt1 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 110, 150, 110),
end = c(11, 100, 121, 200, 200))
dt2 <- data.table(chromosome = c("1", "1", "X"),
start = c(12, 60, 50),
end = c(20, 115, 80))
I need to create a third data.table that provides coordinates for sequences that contain all the integers in dt1 that do not overlap with any integers from the sequences in dt2. For example:
dt3 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 116, 150, 110),
end = c(11, 59, 121, 200, 200))
The data.tables I need to run this on are very large and therefore I need to maximise performance. I have tried using the foverlaps() function but to no avail. Any help would be greatly appreciated!
Upvotes: 1
Views: 163
Reputation: 42544
For the sake of completeness, there is a concise solution using the GenomicRanges
packages from Bioconductor:
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2))
GRanges object with 5 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] 1 1-11 * [2] 1 50-59 * [3] 1 116-121 * [4] 1 150-200 * [5] X 110-200 * ------- seqinfo: 2 sequences from an unspecified genome; no seqlengths
In case the result is required to be of class data.table
:
library(data.table) # development version 1.14.3 used
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2)) |>
as.data.table() |>
DT(, .(chromosome = seqnames, start, end))
chromosome start end <fctr> <int> <int> 1: 1 1 11 2: 1 50 59 3: 1 116 121 4: 1 150 200 5: X 110 200
As mentioned by Waldi, the GenomicRanges
package is not available from CRAN. Waldi has provided the link to the installation guide in the BiocManager
vignette. Here is the short version:
install.packages("BiocManager")
BiocManager::install("GenomicRanges")
Upvotes: 1
Reputation: 2419
You can start with something like this from foverlaps
setkey(dt2,chromosome,start,end)
ds = foverlaps(dt1,dt2, type="any")
ds[,.(chromosome,
start = fcase(is.na(start) | i.start <= start,i.start,
i.end >= end, end + 1),
end = fcase(is.na(end) | i.end >= end, i.end,
i.start <= start, start - 1)
)]
# chromosome start end
# <char> <num> <num>
#1: 1 1 11
#2: 1 50 59
#3: 1 116 121
#4: 1 150 200
#5: X 110 200
Upvotes: 4