Benjamin
Benjamin

Reputation: 350

Python: optimizing pairwise overlaps between intervals

I have a great amount of intervals (around 5k to 10k). These elements have a start and end position; like (203, 405). The coordinates of the intervals are stored in a list.

I want to determine the coordinates and lengths of the overlapping parts between each pair of intervals. This can be done as follows:

# a small list for clarity, with length normally around 5000s
cList = ((20, 54), (25, 48), (67, 133), (90,152), (140,211), (190,230)) 
for i, c1 in enumerate(cList[:-1]): # a linear pairwise combination
    for c2 in cList[i + 1:]:
        left =  max(c1[0], c2[0])
        right = min(c1[1], c2[1])
        overlap = right - left
        if overlap > 0:
            print "left: %s, right: %s, length: %s" % (left, right, overlap)

Resulting in:

left: 25, right: 48, length: 23
left: 90, right: 133, length: 43
left: 140, right: 152, length: 12
left: 190, right: 211, length: 21

As can be seen, it works... since this can take quite some time (20 seconds) my question is, how would I optimize this? I tried to cut off the second for loop when the start position of the second loop exceeds the first end position:

if c1[1] < c2[0]:
    break

This drastically decreases the procedure time, but the resulting number of overlaps is about three times less than before, and therefor the result is certainly not valid. This is caused by elements which are much greater in length than preceding elements.

I'm sure there is some mathematical trick to solve this issue

Upvotes: 5

Views: 649

Answers (1)

loopbackbee
loopbackbee

Reputation: 23322

Generally, this kind of problem becomes much easier if you sort the intervals by starting point.

Firstly, let us define a function, in order to make things clearer:

def overlap(r1, r2):
    left =  max(r1[0], r2[0])
    right = min(r1[1], r2[1])
    over = right - left
    return (left, right, over) if over>0 else None

The algorithm you described can then be written as:

for i, c1 in enumerate(cList[:-1]): 
    for c2 in cList[i + 1:]:
        o = overlap(c1, c2)
        if o is not None:
            print("left: %s, right: %s, length: %s" % o)

If you sort the elements, you can "short-circuit" as soon as you find a non-overlapping segment, since you know those further in the list will be even further "away":

l = sorted(cList)
for i, c1 in enumerate(l[:-1]): 
    for c2 in l[i + 1:]:
        o= overlap(c1, c2)
        if o is None:
            break
        print("left: %s, right: %s, length: %s" % o)

Of course, if your input is already sorted (as it seems), you can skip that step.

Note that, in general, instead of using the double for, you can use the much clearer itertools.combinations. It guarantees the same kind of ordering. Unfortunately, it isn't suitable for the optimized version of the algorithm, but yours could be written as

from itertools import combinations
for c1, c2 in combinations(cList, 2):
    o= overlap(c1, c2)
    if o is not None:
        print("left: %s, right: %s, length: %s" % o)

Finally, if you ever want to perform fast generic operations on intervals, you may also want to consider using a Interval Tree data structure. There's a Python implementation on PyPI.

Upvotes: 6

Related Questions