Vasily A
Vasily A

Reputation: 8646

Find all ranges outside of defined set of ranges

I am wondering what would be the best way to define all ranges which are not covered by given set of ranges. For example, if I have a set of genes with known coordinates:

dtGenes <- fread(
  "id,start,end
 1,1000,1300
 2,1200,1500
 3,1600,2600
 4,3000,4000
")

Let's say I know that total length of the chromosome (for simplicity, assume they are all on the same chromosome) is 10000. So, finally I expect to have the following list of intergenic regions:

"startR,endR
    0,1000
 1500,1600
 2600,3000
 4000,10000
"

can Bioconductor's IRange be useful here? or there is some other good way to solve this?

Upvotes: 4

Views: 454

Answers (2)

agstudy
agstudy

Reputation: 121608

You can use reduce from IRanges package

reduce first orders the ranges in x from left to right, then merges the overlapping or adjacent ones.

library(IRanges)
dat <- read.csv(text="id,start,end
 1,1000,1300
 2,1200,1500
 3,1600,2600
 4,3000,4000
")

ir <- IRanges(dat$start,dat$end)
rir <- reduce(ir)
IRanges of length 3
    start  end width
[1]  1000 1500   501
[2]  1600 2600  1001
[3]  3000 4000  1001

Upvotes: 1

Martin Morgan
Martin Morgan

Reputation: 46886

Using the Bioconductor package GenomicRanges, transform your original data to a GRanges

library(GenomicRanges)
gr <- with(dtGenes, GRanges("chr1", IRanges(start, end, names=id),
                            seqlengths=c(chr1=10000)))

Then find the gaps between your genes

gaps <- gaps(gr)

GRanges knows about strand. You didn't specify a strand in the GRanges constructor, so strand was assigned *. There are therefore 'gaps' on the +, -, and * strands, and you're only interested in those on the * strand

> gaps[strand(gaps) == "*"]
GRanges with 4 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 [   1,   999]      *
  [2]     chr1 [1501,  1599]      *
  [3]     chr1 [2601,  2999]      *
  [4]     chr1 [4001, 10000]      *
  ---
  seqlengths:
    chr1
   10000

Note the Bioconductor convention that chromosomes start at 1, and that the ranges are closed -- the start and end coordinates are included in the range. Use shift and narrow on gr to make your ranges consistent with Bioconductor conventions. GRanges operations are efficient on 10's of millions of ranges.

Upvotes: 4

Related Questions