Speed up groupby and aggregate in large datasets

Question

Is there any chance to speed up the use of groupby and agrregate on large datasets?

I have dataframe like this:

User Category
A    Cat 
B    Dog    
C    Cat
A    Dog

I want to display all categories to each user in array, like this:

User Category
A    [Cat,Dog]
B    [Dog]
C    [Cat]

The code I'm using for this looks like this:

 df = df.groupby('User')['Category'].aggregate(
        lambda x: x.unique().tolist()).reset_index()

But the processing time for large files is too long

BENY · Accepted Answer

Let us drop_duplicate before groupby

out = df.drop_duplicates().groupby('User')['Category'].agg(list)
Out[249]: 
User
A    [Cat, Dog]
B         [Dog]
C         [Cat]
Name: Category, dtype: object

Speed up groupby and aggregate in large datasets

Answers (1)

Related Questions