Reputation: 1809
I am currently trying to learn more about Pandas, and was looking at the Boolean Indexing section in the documentation.
In the example given, it is noted that for a DataFrame
In[439]: df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
'c' : np.random.randn(7)})
In[440]: df2
Out[440]:
a b c
0 one x -0.858441
1 one y 0.643366
2 two y -0.862198
3 three x -0.408981
4 two y 1.137740
5 one x 0.829057
6 six x -1.251656
using the map
method of a Series like
In [441]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [442]: df2[criterion]
Out[442]:
a b c
2 two y 0.041290
3 three x 0.361719
4 two y -0.238075
is faster than the equivalent list comprehension
In [443]: df2[[x.startswith('t') for x in df2['a']]]
Out[443]:
a b c
2 two y 0.041290
3 three x 0.361719
4 two y -0.238075
Now for this situation I would use df2[df2.a.str.startswith('t')]
which wasn't mentioned in the docs, so I wanted to benchmark the approaches.
Small Series benchmark
%timeit df2[df2.a.str.startswith('t')]
1000 loops, best of 3: 880 µs per loop
%timeit df2[df2['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 783 µs per loop
%timeit df2[[x.startswith('t') for x in df2['a']]]
1000 loops, best of 3: 572 µs per loop
Surprisingly, the list comprehension approach seemed to be the fastest! So I tried making the DataFrame much larger and benchmarking again.
In[444]: df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six']*1000000,
'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x']*1000000,
'c' : np.random.randn(7*1000000)})
Big Series benchmark
%timeit df2[df2.a.str.startswith('t')]
1 loop, best of 3: 5.89 s per loop
%timeit df2[df2['a'].map(lambda x: x.startswith('t'))]
1 loop, best of 3: 5.73 s per loop
%timeit df2[[x.startswith('t') for x in df2['a']]]
1 loop, best of 3: 3.95 s per loop
Still the list comprehension seems to be the fastest, and the difference is even more noticeable with a large Series.
Is the map
method of a Series generally actually faster than a list comprehension for Boolean Indexing? Why does the list comprehension approach seem to actually be faster here and contradict the documentation section?
There may very well be an error in my benchmarking approach or testing, and I am well aware that for more complex criterion (and even simple criterion) one of the first two approaches is much nicer to look at and implement, but my question here is just about performance.
Note: I am using Pandas version 0.18.1
Upvotes: 2
Views: 234
Reputation: 210842
IMO it's not about boolean indexing - it's about working with string series:
In [204]: %timeit df.a.str.startswith('t')
10 loops, best of 3: 75.7 ms per loop
In [205]: %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 76.5 ms per loop
In [206]: %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 39.7 ms per loop
In [209]: %timeit [df.a.str[0] == 't']
10 loops, best of 3: 85.2 ms per loop
DF shape: 70.000 x 3
In [207]: df.shape
Out[207]: (70000, 3)
Often list comprehension is faster when working with string
series.
PS i was using Pandas version 0.19.2 for this example
Upvotes: 2