Reputation: 1354
I have a pandas dataframe which I'd like to filter based on if certain conditions are met. I ran a loop and a .apply()
and used %%timeit
to test for speed. The dataset has around 45000 rows. The code snippet for loop is:
%%timeit
qualified_actions = []
for row in all_actions.index:
if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
qualified_actions.append(True)
else:
qualified_actions.append(False)
1.44 s ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And for .apply()
is:
%%timeit
qualified_actions = all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1)
6.71 s ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I thought .apply()
is supposed to be much faster than looping through rows in pandas. Can someone explain why it's slower in this case?
Upvotes: 2
Views: 10706
Reputation: 862761
apply
uses loops under the hood, so if you need better performance the best and the fastest methods are vecorized alternatives.
No loops, only chain 2 conditions vectorized solution:
m1 = all_actions['Lower'] <= all_actions['Mid']
m2 = all_actions['Mid'] <= all_actions['Upper']
qualified_actions = m1 & m2
Thanks Jon Clements for another solution:
all_actions.Mid.between(all_actions.Lower, all_actions.Upper)
Timings:
np.random.seed(2017)
N = 45000
all_actions=pd.DataFrame(np.random.randint(50, size=(N,3)),columns=['Lower','Mid','Upper'])
#print (all_actions)
In [85]: %%timeit
...: qualified_actions = []
...: for row in all_actions.index:
...: if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
...: qualified_actions.append(True)
...: else:
...: qualified_actions.append(False)
...:
...:
__main__:259: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
1 loop, best of 3: 579 ms per loop
In [86]: %%timeit
...: (all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1))
...:
1 loop, best of 3: 1.17 s per loop
In [87]: %%timeit
...: ((all_actions['Lower'] <= all_actions['Mid']) & (all_actions['Mid'] <= all_actions['Upper']))
...:
1000 loops, best of 3: 509 µs per loop
In [90]: %%timeit
...: (all_actions.Mid.between(all_actions.Lower, all_actions.Upper))
...:
1000 loops, best of 3: 520 µs per loop
Upvotes: 4