Reputation: 2454
I am seeing a huge difference in performance between pandas 0.11 and pandas 0.13 on simple series operations.
In [7]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})
In [8]: pandas.__version__
Out[8]: '0.13.0'
In [9]: %timeit df['a'].values+df['b'].values
100 loops, best of 3: 4.33 ms per loop
In [10]: %timeit df['a']+df['b']
10 loops, best of 3: 42.5 ms per loop
On version 0.11 however (on the same machine),
In [10]: pandas.__version__
Out[10]: '0.11.0'
In [11]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})
In [12]: %timeit df['a'].values+df['b'].valuese
100 loops, best of 3: 2.22 ms per loop
In [13]: %timeit df['a']+df['b']
100 loops, best of 3: 2.3 ms per loop
So on 0.13, it's about 20x slower. Profiling it, I see
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.047 0.047 <string>:1(<module>)
1 0.000 0.000 0.047 0.047 ops.py:462(wrapper)
3 0.000 0.000 0.044 0.015 series.py:134(__init__)
1 0.000 0.000 0.044 0.044 series.py:2394(_sanitize_array)
1 0.000 0.000 0.044 0.044 series.py:2407(_try_cast)
1 0.000 0.000 0.044 0.044 common.py:1708(_possibly_cast_to_datetime)
1 0.044 0.044 0.044 0.044 {pandas.lib.infer_dtype}
1 0.000 0.000 0.003 0.003 ops.py:442(na_op)
1 0.000 0.000 0.003 0.003 expressions.py:193(evaluate)
1 0.000 0.000 0.003 0.003 expressions.py:93(_evaluate_numexpr)
So it's spending some huge amount of time on _possibly_cash_to_datetime and pandas.lib.infer_dtype.
Is this change expected? How can I get the old, faster performance back?
Note: It appears that the problem is that the output is of an integer type. If I make one of the columns a double, it goes back to being fast ...
Upvotes: 3
Views: 545
Reputation: 129018
This was a very odd bug having to do (I think) with a strange lookup going on in cython. For some reason
_TYPE_MAP = { np.int64 : 'integer' }
np.int64 in _TYPE_MAP
was not evaluating correctly, ONLY for int64
(but worked just fine for all other dtypes). Its possible the hash of the np.dtype
object was screwy for some reason. In any event, fixed here: https: github.com/pydata/pandas/pull/7342 so we use name hashing instead.
Here's the perf comparison:
master
In [1]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})
In [2]: %timeit df['a'] + df['b']
100 loops, best of 3: 2.49 ms per loop
0.14.0
In [6]: df = pandas.DataFrame({'a':np.arange(1000000), 'b':np.arange(1000000)})
In [7]: %timeit df['a'] + df['b']
10 loops, best of 3: 35.1 ms per loop
Upvotes: 2