Powege
Powege

Reputation: 705

How to remove overlapping sequences from two datatables?

I have two data.tables that provide sequence coordinates across different chromosomes (categories). For example:

library(data.table)
dt1 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
                  start = c(1, 50, 110, 150, 110),
                  end = c(11, 100, 121, 200, 200))
dt2 <- data.table(chromosome = c("1", "1", "X"),
                  start = c(12, 60, 50),
                  end = c(20, 115, 80))

I need to create a third data.table that provides coordinates for sequences that contain all the integers in dt1 that do not overlap with any integers from the sequences in dt2. For example:

dt3 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
                  start = c(1, 50, 116, 150, 110),
                  end = c(11, 59, 121, 200, 200))

The data.tables I need to run this on are very large and therefore I need to maximise performance. I have tried using the foverlaps() function but to no avail. Any help would be greatly appreciated!

Upvotes: 1

Views: 163

Answers (2)

Uwe
Uwe

Reputation: 42544

For the sake of completeness, there is a concise solution using the GenomicRanges packages from Bioconductor:

library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2))
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      1-11      *
  [2]        1     50-59      *
  [3]        1   116-121      *
  [4]        1   150-200      *
  [5]        X   110-200      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

In case the result is required to be of class data.table:

library(data.table) # development version 1.14.3 used
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2)) |> 
  as.data.table() |>
  DT(, .(chromosome = seqnames, start, end))
   chromosome start   end
       <fctr> <int> <int>
1:          1     1    11
2:          1    50    59
3:          1   116   121
4:          1   150   200
5:          X   110   200

As mentioned by Waldi, the GenomicRanges package is not available from CRAN. Waldi has provided the link to the installation guide in the BiocManager vignette. Here is the short version:

install.packages("BiocManager")
BiocManager::install("GenomicRanges")

Upvotes: 1

Peace Wang
Peace Wang

Reputation: 2419

You can start with something like this from foverlaps

setkey(dt2,chromosome,start,end)
ds = foverlaps(dt1,dt2,  type="any")
ds[,.(chromosome, 
      start = fcase(is.na(start) | i.start <= start,i.start,
                    i.end >= end, end + 1),
      end = fcase(is.na(end) | i.end >= end, i.end,
                  i.start <= start, start - 1)
      )]
#   chromosome start   end
#       <char> <num> <num>
#1:          1     1    11
#2:          1    50    59
#3:          1   116   121
#4:          1   150   200
#5:          X   110   200

Upvotes: 4

Related Questions