Reputation: 30424
Edit to add: This operation appears to be tremendously improved due to the GIL unlocking in version 0.17.0 of pandas (and other improvements since version 0.14.1 and earlier). See updated benchmarks at the bottom of this question.
This is a followup to this very useful Q/A: Faster way to transform group with mean value in Pandas
I just updated from 14.0 to 14.1 to see how much improvement there was in groupby/transform operations. In brief, the improvement is substantial but it's still much slower than the workaround and essentially unusable for the data I'm working with.
Here's an example with 100,000 obs with 3 obs per group:
df = DataFrame( { 'id' : np.arange( 100000 ) / 3,
'val': np.random.randn( 100000) } )
grp=df.groupby('id')['val']
a = pd.Series(np.repeat(grp.mean().values, grp.count().values))
b = grp.transform(np.mean)
"a" is the awesome workaround from Mr E and Jeff (see link above) for which I am very grateful and "b" is what I think is the standard approach for this case.
In [42]: (a==b).all()
Out[42]: True
In [43]: %timeit pd.Series(np.repeat(grp.mean().values, grp.count().values))
100 loops, best of 3: 3.34 ms per loop
In [44]: %timeit grp.transform(np.mean)
1 loops, best of 3: 4.61 s per loop
Note that's "ms" vs "s" so 1000x difference! I tried to be careful here and do a fair comparison. Please let me know if I screwed that up somehow. I don't understand numpy/pandas internals very well, but assume they are both using the same underlying np.mean function?
More info:
In [61]: %timeit grp.transform('mean')
1 loops, best of 3: 4.59 s per loop
In [62]: pd.__version__
Out[62]: '0.14.1'
~/google drive/data>python -V
Python 2.7.8 :: Anaconda 2.0.1 (x86_64)
I've got a 13 inch macbook air (2012) and using all Anaconda defaults except:
conda install pandas=0.14.1
Edit to add: Here are some updated benchmarks. I'm using a faster computer now so this will compare 0.16.2 and 0.17.0 on a macbook pro (15 inch, mid 2015).
version 0.16.2
%timeit pd.Series(np.repeat(grp.mean().values, grp.count().values))
100 loops, best of 3: 2.71 ms per loop
%timeit grp.transform(np.mean)
100 loops, best of 3: 18.9 ms per loop
version 0.17.0
%timeit pd.Series(np.repeat(grp.mean().values, grp.count().values))
100 loops, best of 3: 2.05 ms per loop
%timeit grp.transform(np.mean)
1000 loops, best of 3: 1.45 ms per loop
Upvotes: 2
Views: 555
Reputation: 128948
The perf improvement in 0.14.1 in this PR, didn't touch on the case of a cythonized function being passed directly (or via name), rather this was improving perf of generic functions (e.g. a passed lambda), by optimizing how the results were set. This PR addresses (and uses) the above solution to provide a substantial perf improvement when using a cythonized (internal) function, e.g. 'mean' in this case.
In the test example goes from 3.6s to 100ms. Note this is not quite as good as a your example above, because you have an implicit optimization. Namely that the group ordering is monotonic increasing. E.g. your groups are in the same order and not interspersed with each other.
Pandas will handle both cases, but checking that this index is actually monotonic takes a small amount of time (hence the difference).
This is merged into master/0.15.0 (releasing prob end of september), though you can simply clone from master, and windows binaries are posted frequently.
Upvotes: 3