Reputation: 18538
I have a map with {integer_key -> list[tuple]}
as key/value pairs. The tuples contain (start,end)
values that represent string indexes for a substring operation.
My goal is to remove overlapping regions and return a map where key/value pairs are {tuple -> integer_key}
.
Ranges mapped to lower integer_keys
take precendence over higher ones.
Below is a runnable example of my current implementation (requires this ordereddict class):
from collections import OrderedDict
string_length = 20
idx_region_map = OrderedDict()
idx_region_map[0] = [(0,2), (7,10)]
idx_region_map[1] = [(4,5), (18,19)]
idx_region_map[2] = [(3,3), (5,6), (10,13)]
idx_region_map[3] = [(15,17), (19,20)]
# Which can be represented as follows:
#
# |012345678901234567890|
# 0|ooo----oooo----------|
# 1|----oo------------oo-|
# 2|---o-oo---oooo-------|
# 3|---------------ooo-oo|
# ...
def filter_overlaps(string_length, idx_region_map):
region_idx_map = {}
occupied = [False for i in range(string_length)]
for idx, regions in idx_region_map.items():
for region in regions:
start, end = region[0], region[1] + 1
overlaps = any(occupied[start:end])
if not overlaps:
for i in range(start, end):
occupied[i] = True
region_idx_map[region] = idx
return region_idx_map
# Prints: {(3, 3): 2, (4, 5): 1, (18, 19): 1, (7, 10): 0, (0, 2): 0, (15, 17): 3}
print filter_overlaps(string_length, idx_region_map)
This seems to work well enough for my needs, but I'm curious to learn what alternative algorithms are there to solve this problem. For example, using different data structures or something more efficient than the above.
Upvotes: 3
Views: 1743
Reputation: 414305
Instead of occupied = [False]*string_length
you could maintain a data structure that stores non-overlapping ranges seen so far and supports .overlaps(start, end)
operation:
def overlaps(self, start, end):
# invariant: all stored intervals do not overlap
# perform binary search on non-overlapping intervals O(log n)
# if overlaps; merge with stored intervals O(m)
# else store (start, end) O(1)
# return whether (start, end) overlaps with already seen intervals
Complexity is O(log n + m)
, where n
- number of stored intervals, m
- number of intervals that overlap with given range.
if string_length
is not large your algorithm seems fine as is.
Upvotes: 0
Reputation: 1973
You can use an Interval tree.
I don't understand Python but I think you are doing a brute force here.
Another way would be to sort based on start index; so for your e.g you get
0 3 4 5 7 10 15 18 19
Now go through each start index and check where its corresponding end index lies w.r.t following start indices through a binary search i.e. here we take 0, get its end index which is 2 and see where 2 lies. Because 2 lies immediately after 0 it does not overlap anything but let's say 0's end index was 17 then it would mean 0,17 overlaps all start indices until 15 which are 3,4,5,7,10,15. The complexity is nlogn.
Edit
What I just realized is that you are retaining 4,5 though 4,5 and 5,6 overlap and I guess because 4,5 integer key is 1 which is less than integer key of 5,6 which is 2. So I guess you always retain the lower integer key though it is overlapping.
If this is the case, the complexity would be O(n^2) because you can't blindly do a binary search. For e.g. if 4's end index was 10 then you will have to go through 5,7 and 10 to check if their integer key is less than 4's. If it is then 4 and its end index can be filtered otherwise retain 4.
Upvotes: 3