bsdfish
bsdfish

Reputation: 2454

Pandas series operations very slow after upgrade

I am seeing a huge difference in performance between pandas 0.11 and pandas 0.13 on simple series operations.

In [7]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})

In [8]: pandas.__version__                                
Out[8]: '0.13.0'

In [9]: %timeit df['a'].values+df['b'].values
100 loops, best of 3: 4.33 ms per loop

In [10]: %timeit df['a']+df['b']                      
10 loops, best of 3: 42.5 ms per loop

On version 0.11 however (on the same machine),

In [10]: pandas.__version__                               
Out[10]: '0.11.0'

In [11]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})

In [12]: %timeit df['a'].values+df['b'].valuese
100 loops, best of 3: 2.22 ms per loop

In [13]: %timeit df['a']+df['b']     
100 loops, best of 3: 2.3 ms per loop

So on 0.13, it's about 20x slower. Profiling it, I see

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.047    0.047 <string>:1(<module>)
        1    0.000    0.000    0.047    0.047 ops.py:462(wrapper)
        3    0.000    0.000    0.044    0.015 series.py:134(__init__)
        1    0.000    0.000    0.044    0.044 series.py:2394(_sanitize_array)
        1    0.000    0.000    0.044    0.044 series.py:2407(_try_cast)
        1    0.000    0.000    0.044    0.044 common.py:1708(_possibly_cast_to_datetime)
        1    0.044    0.044    0.044    0.044 {pandas.lib.infer_dtype}
        1    0.000    0.000    0.003    0.003 ops.py:442(na_op)
        1    0.000    0.000    0.003    0.003 expressions.py:193(evaluate)
        1    0.000    0.000    0.003    0.003 expressions.py:93(_evaluate_numexpr)

So it's spending some huge amount of time on _possibly_cash_to_datetime and pandas.lib.infer_dtype.

Is this change expected? How can I get the old, faster performance back?

Note: It appears that the problem is that the output is of an integer type. If I make one of the columns a double, it goes back to being fast ...

Upvotes: 3

Views: 545

Answers (1)

Jeff
Jeff

Reputation: 129018

This was a very odd bug having to do (I think) with a strange lookup going on in cython. For some reason

_TYPE_MAP = { np.int64 : 'integer' }
np.int64 in _TYPE_MAP

was not evaluating correctly, ONLY for int64 (but worked just fine for all other dtypes). Its possible the hash of the np.dtype object was screwy for some reason. In any event, fixed here: https: github.com/pydata/pandas/pull/7342 so we use name hashing instead.

Here's the perf comparison:

master

In [1]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})

In [2]: %timeit df['a'] + df['b']
100 loops, best of 3: 2.49 ms per loop

0.14.0

In [6]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})

In [7]: %timeit df['a'] + df['b']
10 loops, best of 3: 35.1 ms per loop

Upvotes: 2

Related Questions