pandas dataframe optimized query

Question

I have a rather large dataframe with upward of 100,000 records. I need to repeatedly perform a range search on a numeric field of this dataframe. I am able to perform this search using loc or query but this is taking lot of time. I think the reason is that right now it is a brute force search. is there someway to generate an index on this numeric field so that I can have a better optimized ranged search on this field.

sample dataframe -

field1   field2   field3
red        car     1000000000
green      truck   2000000000
yellow     bus     3000000000
white      bike    4000000000
black      cycle   5000000000

search -

dataframe.query(field3 > 1000000000 & field3 < 5000000000)

I am using this query in a for loop which runs around 6000 times. My overall code is taking around 25 mins. From 150 ms, calling this piece 6000 times will be 900 seconds i.e. 15 mins. If I can somehow create a index on this field then I think the query time would be reduced drastically.

hilberts_drinking_problem · Accepted Answer

One approach is to sort the numeric column, then use searchsorted + iloc. E.g.

df.iloc[df.field3.searchsorted(min_v, 'left'):df.field3.searchsorted(max_v, 'right')]

This seems to be about 8 times as fast as df.query on my machine with 100k records.

If you know the set of your queries ahead of time, you can do better. For example, say you have arrays mins and maxs with len(mins) == len(maxs) == 6000. You can pre-compute

min_ix = df.field3.searchsorted(mins, 'left')
max_ix = df.field3.searchsorted(maxs, 'right')

And iterate over filtered frames via

for i, j in zip(min_ix, max_ix):
    print(df.iloc[i:j])

pandas dataframe optimized query

Answers (1)

Related Questions