Reputation: 9542
I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.
reindex()
(as below) but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). I want to avoid this unnecessary copying as it will be really inefficient when filtering a big Series or DataFrame.apply()
, map()
, or something similar might be better. I'm pretty new to Pandas though so still trying to wrap my head around everything.I want to take a dictionary of the following form and apply each operation to a given Series object and return a 'filtered' Series object.
relops = {'>=': [1], '<=': [1]}
I'll start with an example of what I have currently and just filtering a single Series object. Below is the function I'm currently using:
def apply_relops(series, relops):
"""
Pass dictionary of relational operators to perform on given series object
"""
for op, vals in relops.iteritems():
op_func = ops[op]
for val in vals:
filtered = op_func(series, val)
series = series.reindex(series[filtered])
return series
The user provides a dictionary with the operations they want to perform:
>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
col1 col2
0 0 10
1 1 11
2 2 12
>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1 1
2 2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1 1
Name: col1
Again, the 'problem' with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.
Upvotes: 254
Views: 574525
Reputation: 1408
Chaining conditions creates long lines, which are discouraged by PEP8.
Using the .query
method forces to use strings, which is powerful but unpythonic and not very dynamic.
Once each of the filters is in place, one approach could be:
import numpy as np
import functools
def conjunction(*conditions):
return functools.reduce(np.logical_and, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[conjunction(c_1,c_2,c_3)]
np.logical
operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce
.
Note that this still has some redundancies:
Still, I expect this to be efficient enough for many applications and it is very readable. You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or
instead:
import numpy as np
import functools
def disjunction(*conditions):
return functools.reduce(np.logical_or, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[disjunction(c_1,c_2,c_3)]
Upvotes: 65
Reputation: 21709
Since pandas 0.22 update, comparison options are available like:
and many more. These functions return boolean array. Let's see how we can use them:
# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})
# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']
1 1
2 2
3 3
4 4
5 5
# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]
col1 col2
0 0 10
1 1 11
2 2 12
# where col1 > 1
df.loc[df['col1'].gt(1)]
col1 col2
2 2 12
3 3 13
4 4 14
5 5 15
Upvotes: 15
Reputation: 4864
If you want to check any/all of multiple columns for a value, you can do:
df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]
Upvotes: 4
Reputation: 2091
e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.
For example
list = [1, 0]
df[df.col1.isin(list)]
Upvotes: 4
Reputation: 13963
Simplest of All Solutions:
Use:
filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]
Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code
filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]
Upvotes: 49
Reputation: 375377
Pandas (and numpy) allow for boolean indexing, which will be much more efficient:
In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]:
1 1
2 2
Name: col1
In [12]: df[df['col1'] >= 1]
Out[12]:
col1 col2
1 1 11
2 2 12
In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]:
col1 col2
1 1 11
If you want to write helper functions for this, consider something along these lines:
In [14]: def b(x, col, op, n):
return op(x[col],n)
In [15]: def f(x, *b):
return x[(np.logical_and(*b))]
In [16]: b1 = b(df, 'col1', ge, 1)
In [17]: b2 = b(df, 'col1', le, 1)
In [18]: f(df, b1, b2)
Out[18]:
col1 col2
1 1 11
Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):
In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
col1 col2
1 1 11
Upvotes: 387
Reputation: 309
Why not do this?
def filt_spec(df, col, val, op):
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec
Demo:
df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')
Result:
a b
1 2 4
2 3 3
3 4 2
4 5 1
You can see that column 'a' has been filtered where a >=2.
This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.
Upvotes: 5