Reputation: 18538

Python - filter overlapping ranges

I have a map with {integer_key -> list[tuple]} as key/value pairs. The tuples contain (start,end) values that represent string indexes for a substring operation.

My goal is to remove overlapping regions and return a map where key/value pairs are {tuple -> integer_key}. Ranges mapped to lower integer_keys take precendence over higher ones.

Below is a runnable example of my current implementation (requires this ordereddict class):

from collections import OrderedDict

string_length = 20

idx_region_map = OrderedDict()
idx_region_map[0] = [(0,2), (7,10)]
idx_region_map[1] = [(4,5), (18,19)]
idx_region_map[2] = [(3,3), (5,6), (10,13)]
idx_region_map[3] = [(15,17), (19,20)]

# Which can be represented as follows:
#
#  |012345678901234567890|
# 0|ooo----oooo----------|
# 1|----oo------------oo-|
# 2|---o-oo---oooo-------|
# 3|---------------ooo-oo|
# ...

def filter_overlaps(string_length, idx_region_map):
    region_idx_map = {}
    occupied = [False for i in range(string_length)]
    for idx, regions in idx_region_map.items():
        for region in regions:
            start, end = region[0], region[1] + 1
            overlaps = any(occupied[start:end])

            if not overlaps:
                for i in range(start, end):
                    occupied[i] = True
                region_idx_map[region] = idx
    return region_idx_map


# Prints: {(3, 3): 2, (4, 5): 1, (18, 19): 1, (7, 10): 0, (0, 2): 0, (15, 17): 3}
print filter_overlaps(string_length, idx_region_map)

This seems to work well enough for my needs, but I'm curious to learn what alternative algorithms are there to solve this problem. For example, using different data structures or something more efficient than the above.

Upvotes: 3

Answers (2)

jfs

Reputation: 414305

Instead of occupied = [False]*string_length you could maintain a data structure that stores non-overlapping ranges seen so far and supports .overlaps(start, end) operation:

def overlaps(self, start, end):
    # invariant: all stored intervals do not overlap
    # perform binary search on non-overlapping intervals O(log n)
    # if overlaps; merge with stored intervals O(m)
    # else store (start, end) O(1)
    # return whether (start, end) overlaps with already seen intervals

Complexity is O(log n + m), where n - number of stored intervals, m - number of intervals that overlap with given range.

if string_length is not large your algorithm seems fine as is.

Upvotes: 0

user1168577

Reputation: 1973

You can use an Interval tree.

I don't understand Python but I think you are doing a brute force here.

Another way would be to sort based on start index; so for your e.g you get

0 3 4 5 7 10 15 18 19

Now go through each start index and check where its corresponding end index lies w.r.t following start indices through a binary search i.e. here we take 0, get its end index which is 2 and see where 2 lies. Because 2 lies immediately after 0 it does not overlap anything but let's say 0's end index was 17 then it would mean 0,17 overlaps all start indices until 15 which are 3,4,5,7,10,15. The complexity is nlogn.

Edit

What I just realized is that you are retaining 4,5 though 4,5 and 5,6 overlap and I guess because 4,5 integer key is 1 which is less than integer key of 5,6 which is 2. So I guess you always retain the lower integer key though it is overlapping.

If this is the case, the complexity would be O(n^2) because you can't blindly do a binary search. For e.g. if 4's end index was 10 then you will have to go through 5,7 and 10 to check if their integer key is less than 4's. If it is then 4 and its end index can be filtered otherwise retain 4.

Upvotes: 3

Python - filter overlapping ranges

Answers (2)

Related Questions