Reputation: 350
I have a great amount of intervals (around 5k to 10k). These elements have a start and end position; like (203, 405). The coordinates of the intervals are stored in a list.
I want to determine the coordinates and lengths of the overlapping parts between each pair of intervals. This can be done as follows:
# a small list for clarity, with length normally around 5000s
cList = ((20, 54), (25, 48), (67, 133), (90,152), (140,211), (190,230))
for i, c1 in enumerate(cList[:-1]): # a linear pairwise combination
for c2 in cList[i + 1:]:
left = max(c1[0], c2[0])
right = min(c1[1], c2[1])
overlap = right - left
if overlap > 0:
print "left: %s, right: %s, length: %s" % (left, right, overlap)
Resulting in:
left: 25, right: 48, length: 23
left: 90, right: 133, length: 43
left: 140, right: 152, length: 12
left: 190, right: 211, length: 21
As can be seen, it works... since this can take quite some time (20 seconds) my question is, how would I optimize this? I tried to cut off the second for loop when the start position of the second loop exceeds the first end position:
if c1[1] < c2[0]:
break
This drastically decreases the procedure time, but the resulting number of overlaps is about three times less than before, and therefor the result is certainly not valid. This is caused by elements which are much greater in length than preceding elements.
I'm sure there is some mathematical trick to solve this issue
Upvotes: 5
Views: 649
Reputation: 23322
Generally, this kind of problem becomes much easier if you sort the intervals by starting point.
Firstly, let us define a function, in order to make things clearer:
def overlap(r1, r2):
left = max(r1[0], r2[0])
right = min(r1[1], r2[1])
over = right - left
return (left, right, over) if over>0 else None
The algorithm you described can then be written as:
for i, c1 in enumerate(cList[:-1]):
for c2 in cList[i + 1:]:
o = overlap(c1, c2)
if o is not None:
print("left: %s, right: %s, length: %s" % o)
If you sort the elements, you can "short-circuit" as soon as you find a non-overlapping segment, since you know those further in the list will be even further "away":
l = sorted(cList)
for i, c1 in enumerate(l[:-1]):
for c2 in l[i + 1:]:
o= overlap(c1, c2)
if o is None:
break
print("left: %s, right: %s, length: %s" % o)
Of course, if your input is already sorted (as it seems), you can skip that step.
Note that, in general, instead of using the double for
, you can use the much clearer itertools.combinations
. It guarantees the same kind of ordering. Unfortunately, it isn't suitable for the optimized version of the algorithm, but yours could be written as
from itertools import combinations
for c1, c2 in combinations(cList, 2):
o= overlap(c1, c2)
if o is not None:
print("left: %s, right: %s, length: %s" % o)
Finally, if you ever want to perform fast generic operations on intervals, you may also want to consider using a Interval Tree data structure. There's a Python implementation on PyPI.
Upvotes: 6