JClark
JClark

Reputation: 199

Pandas dataframe.query to SQL 'LIKE' WITHOUT using engine='python'

There are plenty of relevant examples of replicating a SQL "LIKE" condition in Pandas dataframe.query(). However, all of the examples I am finding essentially boil down to utilizing a column.str.contains(), which requires the additional engine='python'.

I am working with very large datasets, frequent queries, and complex WHERE conditions. The performance hit for using the python engine adds up quickly for my use case.

Is there a syntactic equivalent that utilizes the standard (and SUPER FAST) numexpr engine? I know that I can utilize other methods outside of .query(), but with complex queries they can become quite verbose.

Upvotes: 2

Views: 652

Answers (1)

Stef
Stef

Reputation: 30609

Let's set up some sample data first:

import pandas as pd
import string
import numexpr as ne

n = 10_000_000
k = 5
df = pd.DataFrame({'col1': pd.np.random.choice(list(string.ascii_lowercase),size=n*k).view((pd.np.str_,k)),
                   'col2': pd.np.random.random(n)})

In order to make contains work with the numexpr engine you'll have to add .values. The query is still slow, however, and the reason is not the .values operation as you can see from the comparison with and without .values for the python engine:

%timeit df.query("col1.str.contains('a').values", engine='numexpr')
#1 loop, best of 3: 3.48 s per loop

%timeit df.query("col1.str.contains('a')", engine='python')
#1 loop, best of 3: 3.55 s per loop

%timeit df.query("col1.str.contains('a').values", engine='python')
#1 loop, best of 3: 3.52 s per loop

The performance benefit of numexpr is negligible however for relatively simple queries (see also https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#performance-of-query). As we can see, .values adds just about 60 ms:

%timeit df.query("col2.values < .5", engine='numexpr')
1 loop, best of 3: 363 ms per loop

%timeit df.query("col2 < .5", engine='numexpr')
1 loop, best of 3: 298 ms per loop

%timeit df.query("col2 < .5", engine='python')
1 loop, best of 3: 299 ms per loop


Numexpr does support contains but it's not implemented in pandas. An - admittedly not very elegant - workaround is to copy the values from pandas into arrays and then use numexpr.evaluate for boolean indexing:

%timeit df.query("col1.str.contains('a').values & (col2 < .5)", engine='numexpr')
#1 loop, best of 3: 3.53 s per loop

%timeit df.query("col1.str.contains('a') & (col2 < .5)", engine='python')
#1 loop, best of 3: 3.45 s per loop

%%timeit
arr1 = df.col1.to_numpy(bytes)
arr2 = df.col2.to_numpy()
df[ne.evaluate("contains(arr1, 'a') & (arr2 < .5)")]
#1 loop, best of 3: 1.18 s per loop

This is about 3 times faster than the pandas version, including the overhead for copying to arrays and boolean indexing. ne.evaluate("contains(arr1, 'a') & (arr2 < .5)") alone takes just 280 ms.

Upvotes: 2

Related Questions