Kryo
Kryo

Reputation: 933

How to compute percentage of overlap in R

I am trying to compute the percentage of overlap between two datasets with genomic coordinates, satisfying certain criteria.

seg2

ID   chrom loc.start   loc.end num.mark seg.mean
AB    1   3010000 173490000     8430   0.0039
AB    1 173510000 173590000        5  -17.738
AB    1 173610000 173830000       12    0.011
AB    1 173850000 173970000        6  -16.121
AB    2   3090000 181990000     8434    0.011
BB   12   3090000  68990000     2950   -0.2022
BB   12  69010000  87790000      889    0.0267
BB   12  88010000  98550000      507   -0.3337
BB   12  98570000 115090000      800    0.0586
BB   12 115110000 119350000      197   -0.2031
BB   12 119370000 119430000        4   -20.671

over

 chr     start   end    CNA      sample.ID
  1  68580000  68640000 loss    1-68580000-68640000
  3  15360000  16000000 loss    3-15360000-16000000
  4 122660000 123500000 gain   4-122660000-123500000
  7  48320000  48400000 loss    7-48320000-48400000
  12 115860000 115980000 loss  12-115860000-115980000
 12 113560000 114920000 gain   12-113560000-114920000

expected output

ID   chrom loc.start   loc.end num.mark seg.mean  lm(percentage of overlap)
AB    1   3010000 173490000     8430   0.0039         %
AB    1 173510000 173590000        5  -17.738     
AB    1 173610000 173830000       12    0.011     
AB    1 173850000 173970000        6  -16.121     
AB    2   3090000 181990000     8434    0.011     
BB   12   3090000  68990000     2950   -0.2022     
BB   12  69010000  87790000      889    0.0267
BB   12  88010000  98550000      507   -0.3337
BB   12  98570000 115090000      800    0.0586
BB   12 115110000 119350000      197   -0.2031
BB   12 119370000 119430000        4   -20.671

I tried this script, but it's not working.

for (i in 1:now(seg2)) { 
    seg2$lm <- if((seg2$chrom[i] == over$chr[i]) |
    (seg2$loc.start[i] <= over$start[i] & seg2$loc.end[i] >= over$end[i]) |
    (over$seg.mean[i] >= 0.459 & seg2$CNA[i] == "gain") |
    (over$seg.mean[i] <= -0.678 & seg2$CNA[i] == "loss"), 
    (over$end[i]-over$start[i])/(seg2$loc.end[i]-seg2$loc.start[i])*100)
    }

I am aware of the GenomicRanges package, but would be grateful for suggestions.

Upvotes: 2

Views: 1749

Answers (1)

Veerendra Gadekar
Veerendra Gadekar

Reputation: 4472

I would strongly suggest you to use GenomicFeatures to do this efficiently. If you are already aware of creating your own Granges objects then you need to do following two steps to get the length of overlap

# to find overlaps
overlappin.index = findOverlaps(object1, object2)

# to get the overlap length 
width(ranges(overlapping.index, ranges(object1),ranges(object2)))

Where, "object1" and "object2" are the GRanges objects with coordinates, and "overlappin.index" is the indexes of the objects which are in overlap. Once you have the length you can easily get the percentages.

Upvotes: 2

Related Questions