Pandas: Efficient way to select rows from a dataframe using multiple criteria

Question

I am selecting/filtering a DataFrame using multiple criteria (comparsion with variables), like so:

results = df1[
    (df1.Year == Year) &
    (df1.headline == text) &
    (df1.price > price1) &
    (df1.price < price2) &
    (df1.promo > promo1) &
    (df1.promo < promo2)
]

While this approach works, it is very slow. Hence I wonder, is there any more efficient way of filtering/selecting rows based on multiple criteria using pandas?

Brad Solomon · Accepted Answer

Your current approach is pretty by-the-book as fair as Pandas syntax goes, in my personal opinion.

One way to optimize, if you really need to do so, is to use the underlying NumPy arrays for generating the boolean masks. Generally speaking, Pandas may come with a bit of additional overhead in how it overloads operators versus NumPy. (With the tradeoff being arguably greater flexibility and intrinsically smooth handling of NaN data.)

price = df1.price.values
promo = df1.promo.values

# Note: this is a view to a slice of df1
results = df1.loc[
    (df1.Year.values == Year) &
    (df1.headline.values == text) &
    (price > price1) &
    (price < price2) &
    (promo > promo1) &
    (promo < promo2)
]

Secondly, check that you are already taking advantage of numexpr, which Pandas is enabled to do:

>>> import pandas as pd
>>> pd.get_option('compute.use_numexpr')  # use `pd.set_option()` if False
True

Pandas: Efficient way to select rows from a dataframe using multiple criteria

Answers (1)

Related Questions