mgab
mgab

Reputation: 3994

Getting rows from a PANDAS dataframe that fulfill (a dictionary?) of requirements

I want to filter rows of a pandas DataFrame by specifying a variable set of column==value conditions.

Let's say we have a toy DataFrame like this one:

from itertools import product
from numpy.random import rand
df = pd.DataFrame([[i,j,k,rand()] for i,j,k,m in product(range(2), repeat=3)],
                    columns=['par1','par2','par3','val'])

where some of the rows would look something like:

   par1  par2  par3       val
0     0     0     0  0.464625
1     0     0     1  0.481147
2     0     1     0  0.817992
3     0     1     1  0.639930
4     1     0     0  0.035160
5     1     0     1  0.549517
6     1     1     0  0.172746
7     1     1     1  0.855064

I wonder what's the best way to select some rows by specifying some column==value conditions that don't need to include all columns, nor be always the same columns or even the same number of columns. I think a dict would be a fairly natural way to specify the conditions:

conditions = {'par1':1, 'par3':0}

In this case any value of df.par2 would do.

df.isin()

I'm aware of df.isin() with dict arguments together with all(1) as shown in the docs (last block of code of the section). The problem is that the values in columns that weren't pass a criteria in the df.isin() call give False, so the subsequent call to all(1) gives an empty DataFrame. (A way out would be to add all the missing columns with ALL possible values as criteria, but it doesn't sound like a nice solution)

df.query()

While writing the question I came up with this other attempt. This one looks quite better: automatically building the query from a conditions dict.

df.query(' & '.join(['({} == {})'.format(k,v)
                     for k,v in conditions.iteritems()]))

it works as expected...

   par1  par2  par3       val
4     1     0     0  0.035160
6     1     1     0  0.172746

Still, I'm not completely convinced, I wonder if is there a more natural/proper/clear way to do it... Pandas is so huge I always have the impression I'm missing the proper way to do things... :P

Upvotes: 0

Views: 735

Answers (1)

DSM
DSM

Reputation: 353379

You could make a Series of conditions and select only those columns:

>>> df[(df[list(conditions)] == pd.Series(conditions)).all(axis=1)]
   par1  par2  par3       val
4     1     0     0  0.937192
6     1     1     0  0.536029

This works because after we make the Series, it compares the way we need it to:

>>> df[list(conditions)]
   par1  par3
0     0     0
1     0     1
2     0     0
3     0     1
4     1     0
5     1     1
6     1     0
7     1     1
>>> df[list(conditions)] == pd.Series(conditions)
    par1   par3
0  False   True
1  False  False
2  False   True
3  False  False
4   True   True
5   True  False
6   True   True
7   True  False

Upvotes: 4

Related Questions