Slow performance of pandas groupby/apply

Question

I am noticing very slow performance when calling groupby and apply for a pandas dataframe (>100x slower than using pure python). My data is a series of nested lists of different lengths but fixed nesting depth, which I convert to a dataframe by adding columns for the list indices:

import pandas as pd
from random import randint

# original data
data1   = [[[[randint(0, 10) for i in range(randint(1, 3))] for i in range(randint(1, 5))] for i in range(500)] for i in range(3)]
# as a DataFrame
data2   = pd.DataFrame(
    [
        (i1, i2, i3, i4, x4)
        for (i1, x1) in enumerate(data1)
        for (i2, x2) in enumerate(x1)
        for (i3, x3) in enumerate(x2)
        for (i4, x4) in enumerate(x3)
    ],
    columns = ['i1', 'i2', 'i3', 'i4', 'x']
)
# with indexing
data3   = data2.set_index(['i1', 'i2', 'i3']).sort_index()

Example data:

>>> data3
           i4   x
i1 i2  i3        
0  0   0    0   8
       0    1   0
       0    2   4
       1    0   4
       2    0   7
       3    0   6
       4    0  10
       4    1   1
       4    2   8
   1   0    0   8
       0    1   9
       0    2   1
       1    0   5
       2    0   9
   2   0    0   1
       1    0   1
       1    1   4
       1    2   0
       2    0   6
       2    1  10
       2    2   8
       3    0   4
       3    1   5
       4    0   3
       4    1   6
   3   0    0   9
       0    1   8
       0    2   7
       1    0   2
       1    1   9
...        ..  ..
2  495 0    0   1
       0    1   6
       0    2   5
       1    0   1
       1    1   8
       1    2   6
   496 0    0   4
       0    1   8
       0    2   3
   497 0    0   3
       0    1  10
       1    0   9
       2    0   6
       2    1   1
       2    2   3
       3    0   0
       4    0  10
   498 0    0   9
       0    1   1
       1    0   2
       1    1  10
       2    0   2
       2    1   2
       2    2   2
       3    0   9
   499 0    0   0
       0    1   2
       1    0   2
       1    1   8
       2    0   6

[8901 rows x 2 columns]

I want to apply a function on the inner most list. In the case below, the function operates on each row individually, but my real code would need to use the group as a whole, so groupby/apply is needed.

%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 7.52 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 4.02 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 8.86 s per loop

The code using pandas is orders of magnitude slower than working with the lists directly. Could someone point out what I am doing wrong? I am using pandas 0.18.1.

Slow performance of pandas groupby/apply

Answers (1)

Related Questions