Carl M
Carl M

Reputation: 215

Searching large structured numpy arrays quickly

I have a structured numpy array of format

dataZero = [(1000, 1045),  # ('gid','lpid')
            (2345, 2500),
                 ...    ]

which has ~130,000 entries in it. I also have another structured array of format

dataSnap = [(1002,...,...,...),   # ('gid',...)
            (2400,...,...,...),
            (2490,...,...,...),
                 ...          ]

but this contains 2 million entries.

For every entry in dataZero, i, I want to find the entries in dataSnap that satisfy the condition dataZero['gid'][i] <= dataSnap['gid'] <= dataZero['lpid'][i]. This should allow for multiple entries to be returned. It can either return the indices of the entries that meet this condition, or the entries themselves - as long as I do this within a reasonable amount of time ~minutes.

In the above example, this would return, for example:

[some code] -> [[0],[1,2],...]   # if returning indices of dataSnap

[some code] -> [[(1002,...,...,...)],  # if returning the entries themselves
                [(2400,...,...,...),(2490,...,...,...)],
               ...]

I can't figure out how to do this quickly - a for loop takes ages. Might some conversion to a set {} be better? Any advice/solutions are appreciated.

Upvotes: 2

Views: 105

Answers (1)

Steven Rumbalski
Steven Rumbalski

Reputation: 45542

Here is an algorithm using Python sorted lists (not arrays). You may need to adjust it to work with numpy.

def helper(dataZero, dataSnap): # assumes arguments are sorted ascending!
    ds_iter = iter(dataSnap)
    curr_snap = next(ds_iter)
    for lo, hi in dataZero:
        while curr_snap[0] < lo:
            curr_snap = next(ds_iter)
        while lo <= curr_snap[0] <= hi:
            yield curr_snap
            curr_snap = next(ds_iter) 

result = list(helper(dataZero, dataSnap))

I tested this with dataSnap containing 2 million randomly generated entries and dataZero containing 130k randomly chosen ranges from dataSnap. It took at most couple of seconds on my system.

Upvotes: 2

Related Questions