Reputation: 199
There are plenty of relevant examples of replicating a SQL "LIKE" condition in Pandas dataframe.query(). However, all of the examples I am finding essentially boil down to utilizing a column.str.contains()
, which requires the additional engine='python'
.
I am working with very large datasets, frequent queries, and complex WHERE conditions. The performance hit for using the python engine adds up quickly for my use case.
Is there a syntactic equivalent that utilizes the standard (and SUPER FAST) numexpr engine? I know that I can utilize other methods outside of .query(), but with complex queries they can become quite verbose.
Upvotes: 2
Views: 652
Reputation: 30609
Let's set up some sample data first:
import pandas as pd
import string
import numexpr as ne
n = 10_000_000
k = 5
df = pd.DataFrame({'col1': pd.np.random.choice(list(string.ascii_lowercase),size=n*k).view((pd.np.str_,k)),
'col2': pd.np.random.random(n)})
In order to make contains
work with the numexpr
engine you'll have to add .values
. The query is still slow, however, and the reason is not the .values
operation as you can see from the comparison with and without .values
for the python
engine:
%timeit df.query("col1.str.contains('a').values", engine='numexpr')
#1 loop, best of 3: 3.48 s per loop
%timeit df.query("col1.str.contains('a')", engine='python')
#1 loop, best of 3: 3.55 s per loop
%timeit df.query("col1.str.contains('a').values", engine='python')
#1 loop, best of 3: 3.52 s per loop
The performance benefit of numexpr is negligible however for relatively simple queries (see also https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#performance-of-query). As we can see, .values
adds just about 60 ms:
%timeit df.query("col2.values < .5", engine='numexpr')
1 loop, best of 3: 363 ms per loop
%timeit df.query("col2 < .5", engine='numexpr')
1 loop, best of 3: 298 ms per loop
%timeit df.query("col2 < .5", engine='python')
1 loop, best of 3: 299 ms per loop
Numexpr
does support contains
but it's not implemented in pandas. An - admittedly not very elegant - workaround is to copy the values from pandas into arrays and then use numexpr.evaluate
for boolean indexing:
%timeit df.query("col1.str.contains('a').values & (col2 < .5)", engine='numexpr')
#1 loop, best of 3: 3.53 s per loop
%timeit df.query("col1.str.contains('a') & (col2 < .5)", engine='python')
#1 loop, best of 3: 3.45 s per loop
%%timeit
arr1 = df.col1.to_numpy(bytes)
arr2 = df.col2.to_numpy()
df[ne.evaluate("contains(arr1, 'a') & (arr2 < .5)")]
#1 loop, best of 3: 1.18 s per loop
This is about 3 times faster than the pandas version, including the overhead for copying to arrays and boolean indexing. ne.evaluate("contains(arr1, 'a') & (arr2 < .5)")
alone takes just 280 ms.
Upvotes: 2