Performance of list comprehensions in boolean indexing

Question

I am currently trying to learn more about Pandas, and was looking at the Boolean Indexing section in the documentation.

In the example given, it is noted that for a DataFrame

In[439]: df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                             'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                             'c' : np.random.randn(7)})

In[440]: df2
Out[440]: 
       a  b         c
0    one  x -0.858441
1    one  y  0.643366
2    two  y -0.862198
3  three  x -0.408981
4    two  y  1.137740
5    one  x  0.829057
6    six  x -1.251656

using the map method of a Series like

In [441]: criterion = df2['a'].map(lambda x: x.startswith('t'))

In [442]: df2[criterion]
Out[442]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

is faster than the equivalent list comprehension

In [443]: df2[[x.startswith('t') for x in df2['a']]]
Out[443]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

Now for this situation I would use df2[df2.a.str.startswith('t')] which wasn't mentioned in the docs, so I wanted to benchmark the approaches.

Small Series benchmark

%timeit df2[df2.a.str.startswith('t')]
1000 loops, best of 3: 880 µs per loop

%timeit df2[df2['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 783 µs per loop

%timeit df2[[x.startswith('t') for x in df2['a']]]
1000 loops, best of 3: 572 µs per loop

Surprisingly, the list comprehension approach seemed to be the fastest! So I tried making the DataFrame much larger and benchmarking again.

In[444]: df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six']*1000000,
                             'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x']*1000000,
                             'c' : np.random.randn(7*1000000)})

Big Series benchmark

%timeit df2[df2.a.str.startswith('t')]
1 loop, best of 3: 5.89 s per loop

%timeit df2[df2['a'].map(lambda x: x.startswith('t'))]
1 loop, best of 3: 5.73 s per loop

%timeit df2[[x.startswith('t') for x in df2['a']]]
1 loop, best of 3: 3.95 s per loop

Still the list comprehension seems to be the fastest, and the difference is even more noticeable with a large Series.

Question

Is the map method of a Series generally actually faster than a list comprehension for Boolean Indexing? Why does the list comprehension approach seem to actually be faster here and contradict the documentation section?

There may very well be an error in my benchmarking approach or testing, and I am well aware that for more complex criterion (and even simple criterion) one of the first two approaches is much nicer to look at and implement, but my question here is just about performance.

Note: I am using Pandas version 0.18.1

MaxU - stand with Ukraine · Accepted Answer

IMO it's not about boolean indexing - it's about working with string series:

In [204]: %timeit df.a.str.startswith('t')
10 loops, best of 3: 75.7 ms per loop

In [205]: %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 76.5 ms per loop

In [206]: %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 39.7 ms per loop

In [209]: %timeit [df.a.str[0] == 't']
10 loops, best of 3: 85.2 ms per loop

DF shape: 70.000 x 3

In [207]: df.shape
Out[207]: (70000, 3)

Often list comprehension is faster when working with string series.

PS i was using Pandas version 0.19.2 for this example

Performance of list comprehensions in boolean indexing

Question

Answers (1)

Related Questions