Alec
Alec

Reputation: 11

Slow performance of pandas groupby/apply

I am noticing very slow performance when calling groupby and apply for a pandas dataframe (>100x slower than using pure python). My data is a series of nested lists of different lengths but fixed nesting depth, which I convert to a dataframe by adding columns for the list indices:

import pandas as pd
from random import randint

# original data
data1   = [[[[randint(0, 10) for i in range(randint(1, 3))] for i in range(randint(1, 5))] for i in range(500)] for i in range(3)]
# as a DataFrame
data2   = pd.DataFrame(
    [
        (i1, i2, i3, i4, x4)
        for (i1, x1) in enumerate(data1)
        for (i2, x2) in enumerate(x1)
        for (i3, x3) in enumerate(x2)
        for (i4, x4) in enumerate(x3)
    ],
    columns = ['i1', 'i2', 'i3', 'i4', 'x']
)
# with indexing
data3   = data2.set_index(['i1', 'i2', 'i3']).sort_index()

Example data:

>>> data3
           i4   x
i1 i2  i3        
0  0   0    0   8
       0    1   0
       0    2   4
       1    0   4
       2    0   7
       3    0   6
       4    0  10
       4    1   1
       4    2   8
   1   0    0   8
       0    1   9
       0    2   1
       1    0   5
       2    0   9
   2   0    0   1
       1    0   1
       1    1   4
       1    2   0
       2    0   6
       2    1  10
       2    2   8
       3    0   4
       3    1   5
       4    0   3
       4    1   6
   3   0    0   9
       0    1   8
       0    2   7
       1    0   2
       1    1   9
...        ..  ..
2  495 0    0   1
       0    1   6
       0    2   5
       1    0   1
       1    1   8
       1    2   6
   496 0    0   4
       0    1   8
       0    2   3
   497 0    0   3
       0    1  10
       1    0   9
       2    0   6
       2    1   1
       2    2   3
       3    0   0
       4    0  10
   498 0    0   9
       0    1   1
       1    0   2
       1    1  10
       2    0   2
       2    1   2
       2    2   2
       3    0   9
   499 0    0   0
       0    1   2
       1    0   2
       1    1   8
       2    0   6

[8901 rows x 2 columns]

I want to apply a function on the inner most list. In the case below, the function operates on each row individually, but my real code would need to use the group as a whole, so groupby/apply is needed.

%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 7.52 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 4.02 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 8.86 s per loop

The code using pandas is orders of magnitude slower than working with the lists directly. Could someone point out what I am doing wrong? I am using pandas 0.18.1.

Upvotes: 1

Views: 2702

Answers (1)

Ted Petrou
Ted Petrou

Reputation: 61967

apply is a method of last resort and is very slow as it passes the entire dataframe group for each iteration to your custom function. In your specific case, you have no need for apply as you are simply multiplying two columns together. The grouping has no effect here. Try to use vectorized functions first if you can or agg and then transform when grouping.

You can simply do data2['i4'] * data2['x'] instead of your groupby and apply.

%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 4.51 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 1.69 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 3.31 s per loop
%timeit data2['i4'] * data2['x']
10000 loops, best of 3: 122 µs per loop

Upvotes: 1

Related Questions