WNG
WNG

Reputation: 3805

Filtering dataframes in pandas : use a list of conditions

I have a pandas dataframe with two dimensions : 'col1' and 'col2'

I can filter certain values of those two columns using :

df[ (df["col1"]=='foo') & (df["col2"]=='bar')]

Is there any way I can filter both columns at once ?

I tried naively to use the restriction of the dataframes to two columns, but my best guesses for the second part of the equality don't work :

df[df[["col1","col2"]]==['foo','bar']]

yields me this error

ValueError: Invalid broadcasting comparison [['foo', 'bar']] with block values

I need to do this because the names of the columns, but also the number of columns on which the condition will be set will vary

Upvotes: 12

Views: 10293

Answers (5)

anon01
anon01

Reputation: 11171

This is a pretty clean solution if you have identical join ops (& or | for all filters:

cols = ['col1', 'col2']
conditions = ['foo', 'bar']

filtered_rows = True
for col, condition in zip(cols, conditions):
    # update filtered_rows with each filter condition
    current_filter = (df[col] == condition)
    filtered_rows &= current_filter

df = df[filtered_rows]

Upvotes: 0

Michael Hoff
Michael Hoff

Reputation: 6318

I would like to point out an alternative for the accepted answer as eval is not necessary for solving this problem.

from functools import reduce

df = pd.DataFrame({'col1': ['foo', 'bar', 'baz'], 'col2': ['bar', 'spam', 'ham']})
cols = ['col1', 'col2']
values = ['foo', 'bar']
conditions = zip(cols, values)

def apply_conditions(df, conditions):
    assert len(conditions) > 0
    comps = [df[c] == v for c, v in conditions]
    result = comps[0]
    for comp in comps[1:]:
        result &= comp
    return result

def apply_conditions(df, conditions):
    assert len(conditions) > 0
    comps = [df[c] == v for c, v in conditions]
    return reduce(lambda c1, c2: c1 & c2, comps[1:], comps[0])

df[apply_conditions(df, conditions)]

Upvotes: 7

yerpderpington
yerpderpington

Reputation: 1

Posting because I ran into a similar issue and found a solution that gets it done in one line albeit a bit inefficiently

cols, vals = ["col1","col2"],['foo','bar']
pd.concat([df.loc[df[cols[i]] == vals[i]] for i in range(len(cols))], join='inner')

This is effectively an & across the columns. To have an | across the columns you can ommit join='inner' and add a drop_duplicates() at the end

Upvotes: 0

cwallenpoole
cwallenpoole

Reputation: 82018

I know I'm late to the party on this one, but if you know that all of your values will use the same sign, then you could use functools.reduce. I have a CSV with something like 64 columns, and I have no desire whatsoever to copy and paste them. This is how I resolved:

from functools import reduce

players = pd.read_csv('players.csv')

# I only want players who have any of the outfield stats over 0.
# That means they have to be an outfielder.
column_named_outfield = lambda x: x.startswith('outfield')

# If a column name starts with outfield, then it is an outfield stat. 
# So only include those columns
outfield_columns = filter(column_named_outfield, players.columns)

# Column must have a positive value
has_positive_value = lambda c:players[c] > 0
# We're looking to create a series of filters, so use "map"
list_of_positive_outfield_columns = map(has_positive_value, outfield_columns)

# Given two DF filters, this returns a third representing the "or" condition.
concat_or = lambda x, y: x | y
# Apply the filters through reduce to create a primary filter
is_outfielder_filter = reduce(concat_or, list_of_positive_outfield_columns)
outfielders = players[is_outfielder_filter]

Upvotes: 0

Alexander
Alexander

Reputation: 109546

To the best of my knowledge, there is no way in Pandas for you to do what you want. However, although the following solution may not me the most pretty, you can zip a set of parallel lists as follows:

cols = ['col1', 'col2']
conditions = ['foo', 'bar']

df[eval(" & ".join(["(df['{0}'] == '{1}')".format(col, cond) 
   for col, cond in zip(cols, conditions)]))]

The string join results in the following:

>>> " & ".join(["(df['{0}'] == '{1}')".format(col, cond) 
    for col, cond in zip(cols, conditions)])

"(df['col1'] == 'foo') & (df['col2'] == 'bar')"

Which you then use eval to evaluate, effectively:

df[eval("(df['col1'] == 'foo') & (df['col2'] == 'bar')")]

For example:

df = pd.DataFrame({'col1': ['foo', 'bar, 'baz'], 'col2': ['bar', 'spam', 'ham']})

>>> df
  col1  col2
0  foo   bar
1  bar  spam
2  baz   ham

>>> df[eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond)) 
            for col, cond in zip(cols, conditions)]))]
  col1 col2
0  foo  bar

Upvotes: 7

Related Questions