Reputation: 215
I have a structured numpy array of format
dataZero = [(1000, 1045), # ('gid','lpid')
(2345, 2500),
... ]
which has ~130,000 entries in it. I also have another structured array of format
dataSnap = [(1002,...,...,...), # ('gid',...)
(2400,...,...,...),
(2490,...,...,...),
... ]
but this contains 2 million entries.
For every entry in dataZero
, i
, I want to find the entries in dataSnap
that satisfy the condition dataZero['gid'][i] <= dataSnap['gid'] <= dataZero['lpid'][i]
. This should allow for multiple entries to be returned. It can either return the indices of the entries that meet this condition, or the entries themselves - as long as I do this within a reasonable amount of time ~minutes.
In the above example, this would return, for example:
[some code] -> [[0],[1,2],...] # if returning indices of dataSnap
[some code] -> [[(1002,...,...,...)], # if returning the entries themselves
[(2400,...,...,...),(2490,...,...,...)],
...]
I can't figure out how to do this quickly - a for loop takes ages. Might some conversion to a set {}
be better? Any advice/solutions are appreciated.
Upvotes: 2
Views: 105
Reputation: 45542
Here is an algorithm using Python sorted lists (not arrays). You may need to adjust it to work with numpy
.
def helper(dataZero, dataSnap): # assumes arguments are sorted ascending!
ds_iter = iter(dataSnap)
curr_snap = next(ds_iter)
for lo, hi in dataZero:
while curr_snap[0] < lo:
curr_snap = next(ds_iter)
while lo <= curr_snap[0] <= hi:
yield curr_snap
curr_snap = next(ds_iter)
result = list(helper(dataZero, dataSnap))
I tested this with dataSnap
containing 2 million randomly generated entries and dataZero
containing 130k randomly chosen ranges from dataSnap
. It took at most couple of seconds on my system.
Upvotes: 2