user3762279
user3762279

Reputation: 41

How to improve the speed of groupby/transform?

I want to implement the groupmax function, which finds the max value within each group, and assign it back to the rows within each group. It seems groupby(name).transform(max) is what I need. E.g.

In [1]: print df
  name     value
0    0  0.363030
1    0  0.324828
2    0  0.499279
3    1  0.799836
4    1  0.886653
5    1  0.335056

In [2]: print df.groupby('name').transform(max)
      value
0  0.499279
1  0.499279
2  0.499279
3  0.886653
4  0.886653
5  0.886653

However this approach is very slow when the size of the data frame becomes large and there are many small groups. E.g. the following code will hang there forever

df = pd.DataFrame({'name' : repeat([str(x) for x in range(0, 1000000)], 2), 'value' : rand(2000000)})
print df.groupby('name').transform(max)

I wonder if there is any fast solutions for this problem?

Thanks a lot!

Upvotes: 2

Views: 306

Answers (1)

DSM
DSM

Reputation: 353059

You could try something like

>>> df = pd.DataFrame({'name': np.repeat(list(map(str,range(10**6))), 2), 'value': np.random.rand(2*10**6)})
>>> %timeit df.groupby("name").max().loc[df.name.values].reset_index(drop=True)
1 loops, best of 3: 2.12 s per loop

Still not great, but better.

Upvotes: 1

Related Questions