Hans
Hans

Reputation: 536

How to speed up df.query

df.query uses numexpr under the hood, but is much slower than pure numexpr

Let's say I have a big DataFrame:

from random import shuffle
import pandas as pd
l1=list(range(1000000))
l2=list(range(1000000))
shuffle(l1)
shuffle(l2)
df = pd.DataFrame([l1, l2]).T
df=df.sample(frac=1)
df=df.rename(columns={0: 'A', 1:'B'})

And I want to compare 2 columns:

%timeit (df.A == df.B) | (df.A / df.B < 1) |  (df.A * df.B > 3000)
10.8 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It takes 10.8 ms in this example

Now I import numexpr and do the same thing:

import numexpr
a = df.A.__array__()
b = df.B.__array__()
%timeit numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')
1.95 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numexpr is 6 times faster than Pandas

Now let's use df.loc:

%timeit df.loc[numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')]
20.5 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.loc[(df.A == df.B) | (df.A / df.B < 1) |  (df.A * df.B > 3000)]
27 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.query('((A == B) | (A / B < 1) | (A * B > 3000))')
32.5 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

numexpr is still significantly faster than pure Pandas. But why is df.query so slow? It uses numexpr under the hood. Is there a way to fix that? Or any other way to use numexpr in pandas without doing a lot of tweaking

Upvotes: 2

Views: 214

Answers (0)

Related Questions