Reputation: 2459
I have a rather large dataframe with upward of 100,000 records. I need to repeatedly perform a range search on a numeric field of this dataframe. I am able to perform this search using loc or query but this is taking lot of time. I think the reason is that right now it is a brute force search. is there someway to generate an index on this numeric field so that I can have a better optimized ranged search on this field.
sample dataframe -
field1 field2 field3
red car 1000000000
green truck 2000000000
yellow bus 3000000000
white bike 4000000000
black cycle 5000000000
search -
dataframe.query(field3 > 1000000000 & field3 < 5000000000)
I am using this query in a for loop which runs around 6000 times. My overall code is taking around 25 mins. From 150 ms, calling this piece 6000 times will be 900 seconds i.e. 15 mins. If I can somehow create a index on this field then I think the query time would be reduced drastically.
Upvotes: 0
Views: 272
Reputation: 11602
One approach is to sort the numeric column, then use searchsorted + iloc. E.g.
df.iloc[df.field3.searchsorted(min_v, 'left'):df.field3.searchsorted(max_v, 'right')]
This seems to be about 8 times as fast as df.query
on my machine with 100k records.
If you know the set of your queries ahead of time, you can do better. For example, say you have arrays mins
and maxs
with len(mins) == len(maxs) == 6000
. You can pre-compute
min_ix = df.field3.searchsorted(mins, 'left')
max_ix = df.field3.searchsorted(maxs, 'right')
And iterate over filtered frames via
for i, j in zip(min_ix, max_ix):
print(df.iloc[i:j])
Upvotes: 1