Reputation: 41
I want to implement the groupmax
function, which finds the max value within each group, and assign it back to the rows within each group. It seems groupby(name).transform(max)
is what I need. E.g.
In [1]: print df
name value
0 0 0.363030
1 0 0.324828
2 0 0.499279
3 1 0.799836
4 1 0.886653
5 1 0.335056
In [2]: print df.groupby('name').transform(max)
value
0 0.499279
1 0.499279
2 0.499279
3 0.886653
4 0.886653
5 0.886653
However this approach is very slow when the size of the data frame becomes large and there are many small groups. E.g. the following code will hang there forever
df = pd.DataFrame({'name' : repeat([str(x) for x in range(0, 1000000)], 2), 'value' : rand(2000000)})
print df.groupby('name').transform(max)
I wonder if there is any fast solutions for this problem?
Thanks a lot!
Upvotes: 2
Views: 306
Reputation: 353059
You could try something like
>>> df = pd.DataFrame({'name': np.repeat(list(map(str,range(10**6))), 2), 'value': np.random.rand(2*10**6)})
>>> %timeit df.groupby("name").max().loc[df.name.values].reset_index(drop=True)
1 loops, best of 3: 2.12 s per loop
Still not great, but better.
Upvotes: 1