pandas groupby sum takes to long, how can I optimize this?

Question

I have a dataframe of around 2 million rows. If I do this groupby

 df = df.groupby(by=['country','os','device'], as_index=False)

It only takes a short time. But if I do:

 df = df.groupby(by=['country','os','device'], as_index=False).sum()

It takes forever and I have to kill the script.

This started when I updated from Pandas 17 to 20.

Why is this happening and how can I rewrite it so it works fast again?

EDIT:

   nl,windows,c,awdo2323fa3rj90
   uk,mac,c, awdawdoj93di303
   nl,ios,m, aawd9efri403
   nl,ios,m, 39fnsefwfpiw3r

[country,os,device,md5_id] output should be

   nl,windows,c
   uk,mac,c
   nl,ios,m

Like EdChum said the groupby returns groupby object so I added sum() and this worked in pandas 17, but I think this is now in 20 causing a problem, because there are no numeric columns.

EdChum · Accepted Answer

To answer some of your queries, a groupby object is just metadata, it describes how to perform the grouping, it only does some work when you call some aggregation function on it, as you have no numerical columns I'm not sure what you're expecting by calling sum.

It looks like all you want is to drop_duplicates:

df.drop_duplicates(subset=['country','os','device'])

So that what is left are non-repeated rows based on the passed subset of columns

pandas groupby sum takes to long, how can I optimize this?

Answers (1)

Related Questions