svural
svural

Reputation: 1031

finding a set of ranges that a number fall in

I have a 200k lines list of number ranges like start_position,stop position. The list includes all kinds of overlaps in addition to nonoverlapping ones.

the list looks like this

I need to find the ranges that a given number fall in. And will repeat it for 100k numbers. For example if 18 is the given number with the list above then the function should return [10,30] [15,25]

I am doing it in a overly complicated way using bisect, can anybody give a clue on how to do it in a faster way.

Thanks

Upvotes: 6

Views: 4398

Answers (6)

Brionius
Brionius

Reputation: 14098

Ok, here is a tree implementation:

import itertools

class treeNode:
    def __init__(self, segments):
        self.value = None
        self.overlapping = None
        self.below = None
        self.above = None
        if len(segments) > 0:
            self.value = sum(itertools.chain.from_iterable(segments))/(2*len(segments))
            self.overlapping = set()
            belowSegs = set()
            aboveSegs = set()
            for segment in segments:
                if segment[0] <= self.value:
                    if segment[1] < self.value:
                        belowSegs.add(segment)
                    else:
                        self.overlapping.add(segment)
                else:
                    aboveSegs.add(segment)
            self.below = treeNode(belowSegs)
            self.above = treeNode(aboveSegs)
        else:
            self.overlapping = None

    def getOverlaps(self, item):
        if self.overlapping is None:
            return set()
        if item < self.value:
            return {x for x in self.overlapping if x[0]<=item and item<=x[1]} | self.below.getOverlaps(item)
        elif item > self.value:
            return {x for x in self.overlapping if x[0]<=item and item<=x[1]} | self.above.getOverlaps(item)
        else:
            return self.overlapping

DEMO:

import random as r

maxInt = 100
numSegments = 200000
rng = range(maxInt-1)
lefts = [r.choice(rng) for x in range(0, numSegments)]
segments = [(x, r.choice(range(x+1, maxInt))) for x in lefts]  # Construct a list of 200,000 random segments from integers between 0 and 100

tree = treeNode(segments)  # Builds the tree structure based on a list of segments/ranges
def testReturnSegments():
    for item in range(0,100000):
        item = item % maxInt
        overlaps = tree.getOverlaps(item)

import cProfile
cProfile.run('testReturnSegments()')

This uses 200k segments, and find the overlapping segments for 100k numbers, and runs in 94 seconds on my 2.3 GHz Intel i5. Here's the cProfile output:

OUTPUT:

         1060004 function calls (580004 primitive calls) in 95.700 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   95.700   95.700 <string>:1(<module>)
580000/100000   11.716    0.000   93.908    0.001 stackoverflow.py:27(getOverlaps)
   239000   43.461    0.000   43.461    0.000 stackoverflow.py:31(<setcomp>)
   241000   38.731    0.000   38.731    0.000 stackoverflow.py:33(<setcomp>)
        1    1.788    1.788   95.700   95.700 stackoverflow.py:46(testReturnSegments)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.004    0.004    0.004    0.004 {range}

Upvotes: 2

zs2020
zs2020

Reputation: 54514

It seems a range coverage problem. Since you have a large amount queries to process, you need to give the result as quickly as possible. There is an algorithm relating to this problem, you can take a look at the Segment Tree.

The idea is to build a Segment Tree first based on the intervals given, and then for each query, you can do as fast as log(N), where N is the number of intervals.

To get all possible k intervals, first find the parent node covering all sub intervals with log(n), and then traverse its children to retrieve all k intervals. So the time complexity of retrieving all intervals for a given number is bounded by log(N) + O(k), where k << n.

This algorithm could be deteriorated to as slow as O(N) when all intervals contains the given number. In this case, since the size of output is N, there does not exist any better solution. Because the complexity of the algorithm must be at least at the same order or higher than the output size.

Hope it helps.

Upvotes: 10

bunnybare
bunnybare

Reputation: 181

The best way to solve this is for you to construct an Interval tree. (The range tree given by sza is static. Which means after you construct it you can't add/delete nodes. If that is ok with you then his answer will suffice.)

You can learn about interval tree here:

Youtube: https://www.youtube.com/watch?v=q0QOYtSsTg4

Wiki: https://en.wikipedia.org/wiki/Interval_tree

An interval tree is basically a binary tree based on the left value of the range tuple. ([left, right]) What's special about it is that each node in the tree has a value named max. The max value holds the biggest right value of the nodes that are below this node, including the node itself. In other words, the current node will have it's max value set to the biggest right value of the subtree which the current node is the root.

To expand this for your problem, we can add minimum value as well. Where the min value will hold the smallest left value of all the subtree where this node is the root.

Edited screencap from the video: (sorry for the quality)

enter image description here

That means when we query for a number you:

1) add current node to set of answers if number is inside range    
2) go left if number is less than max value of left child.    
3) go right if number is greater than min value of right child.

Upvotes: 4

Brionius
Brionius

Reputation: 14098

import random as r
rng = range(99)
lefts = [r.choice(rng) for x in range(0, 200000)]
segments = [(x, r.choice(range(x+1, 100))) for x in lefts]

leftList = sorted(segments, key=lambda x:x[0])
rightList = sorted(segments, key=lambda x:x[1])

def returnSegments(item, segments, leftList, rightList):
    for leftN in range(len(leftList)):
        if leftList[leftN][0] > item:
            break
    for rightN in range(len(rightList)):
        if rightList[rightN][1] > item:
            break
    return list(set(leftList[:leftN]) & set(rightList[rightN:]))

def testReturnSegments():
    for item in range(0,100):
        item = item % 100
        returnSegments(item, segments, leftList, rightList)

This code uses 200k segments, and 100 numbers. It ran on my 2.3 GHz i5 macbook in 9.3 seconds, meaning for a full 200k x 100k run, you'd need 2.5 hours. Probably not fast enough, but there may be ways to speed this approach up - I'll think about it.

Upvotes: 2

idfah
idfah

Reputation: 1438

How about,

  1. sort by first column O(n log n)

  2. binary search to find indices that are out of range O(log n)

  3. throw out values out of range

  4. sort by second column O(n log n)

  5. binary search to find indices that are out of range O(log n)

  6. throw out values out of range

  7. you are left with the values in range

This should be O(n log n)

You can sort rows and cols with np.sort and a binary search should only be a few lines of code.

If you have lots of queries, you can save the first sorted copy for subsequent calls but not the second. Depending on the number of queries, it may turn out to be better to do a linear search than to sort then search.

Upvotes: 0

shx2
shx2

Reputation: 64308

def get_containing_intervals(x):
    for start, end in intervals:
        if start <= x <= end:
            yield x

Using <= here is based on the assumption the intervals are inclusive (close-ended).

If you intend to call this function many times, you probably want to do some kind of preprocessing, e.g. what @sza suggests.

Upvotes: 0

Related Questions