Reputation: 3994
I want to filter rows of a pandas DataFrame
by specifying a variable set of column==value
conditions.
Let's say we have a toy DataFrame
like this one:
from itertools import product
from numpy.random import rand
df = pd.DataFrame([[i,j,k,rand()] for i,j,k,m in product(range(2), repeat=3)],
columns=['par1','par2','par3','val'])
where some of the rows would look something like:
par1 par2 par3 val
0 0 0 0 0.464625
1 0 0 1 0.481147
2 0 1 0 0.817992
3 0 1 1 0.639930
4 1 0 0 0.035160
5 1 0 1 0.549517
6 1 1 0 0.172746
7 1 1 1 0.855064
I wonder what's the best way to select some rows by specifying some
column==value
conditions that don't need to include all columns, nor be always the same columns or even the same number of columns. I think a dict
would be a fairly natural way to specify the conditions:
conditions = {'par1':1, 'par3':0}
In this case any value of df.par2
would do.
df.isin()
I'm aware of df.isin()
with dict
arguments together with all(1)
as shown in the docs (last block of code of the section). The problem is that the values in columns that weren't pass a criteria in the df.isin()
call give False
, so the subsequent call to all(1)
gives an empty DataFrame. (A way out would be to add all the missing columns with ALL possible values as criteria, but it doesn't sound like a nice solution)
df.query()
While writing the question I came up with this other attempt. This one looks quite better: automatically building the query from a conditions dict
.
df.query(' & '.join(['({} == {})'.format(k,v)
for k,v in conditions.iteritems()]))
it works as expected...
par1 par2 par3 val
4 1 0 0 0.035160
6 1 1 0 0.172746
Still, I'm not completely convinced, I wonder if is there a more natural/proper/clear way to do it... Pandas is so huge I always have the impression I'm missing the proper way to do things... :P
Upvotes: 0
Views: 735
Reputation: 353379
You could make a Series of conditions
and select only those columns:
>>> df[(df[list(conditions)] == pd.Series(conditions)).all(axis=1)]
par1 par2 par3 val
4 1 0 0 0.937192
6 1 1 0 0.536029
This works because after we make the Series, it compares the way we need it to:
>>> df[list(conditions)]
par1 par3
0 0 0
1 0 1
2 0 0
3 0 1
4 1 0
5 1 1
6 1 0
7 1 1
>>> df[list(conditions)] == pd.Series(conditions)
par1 par3
0 False True
1 False False
2 False True
3 False False
4 True True
5 True False
6 True True
7 True False
Upvotes: 4