Searching large structured numpy arrays quickly

Question

I have a structured numpy array of format

dataZero = [(1000, 1045),  # ('gid','lpid')
            (2345, 2500),
                 ...    ]

which has ~130,000 entries in it. I also have another structured array of format

dataSnap = [(1002,...,...,...),   # ('gid',...)
            (2400,...,...,...),
            (2490,...,...,...),
                 ...          ]

but this contains 2 million entries.

For every entry in dataZero, i, I want to find the entries in dataSnap that satisfy the condition dataZero['gid'][i] <= dataSnap['gid'] <= dataZero['lpid'][i]. This should allow for multiple entries to be returned. It can either return the indices of the entries that meet this condition, or the entries themselves - as long as I do this within a reasonable amount of time ~minutes.

In the above example, this would return, for example:

[some code] -> [[0],[1,2],...]   # if returning indices of dataSnap

[some code] -> [[(1002,...,...,...)],  # if returning the entries themselves
                [(2400,...,...,...),(2490,...,...,...)],
               ...]

I can't figure out how to do this quickly - a for loop takes ages. Might some conversion to a set {} be better? Any advice/solutions are appreciated.

Steven Rumbalski · Accepted Answer

Here is an algorithm using Python sorted lists (not arrays). You may need to adjust it to work with numpy.

def helper(dataZero, dataSnap): # assumes arguments are sorted ascending!
    ds_iter = iter(dataSnap)
    curr_snap = next(ds_iter)
    for lo, hi in dataZero:
        while curr_snap[0] < lo:
            curr_snap = next(ds_iter)
        while lo <= curr_snap[0] <= hi:
            yield curr_snap
            curr_snap = next(ds_iter) 

result = list(helper(dataZero, dataSnap))

I tested this with dataSnap containing 2 million randomly generated entries and dataZero containing 130k randomly chosen ranges from dataSnap. It took at most couple of seconds on my system.

Searching large structured numpy arrays quickly

Answers (1)

Related Questions