Reputation: 7072
I have a dataframe of around 2 million rows. If I do this groupby
df = df.groupby(by=['country','os','device'], as_index=False)
It only takes a short time. But if I do:
df = df.groupby(by=['country','os','device'], as_index=False).sum()
It takes forever and I have to kill the script.
This started when I updated from Pandas 17 to 20.
Why is this happening and how can I rewrite it so it works fast again?
EDIT:
nl,windows,c,awdo2323fa3rj90
uk,mac,c, awdawdoj93di303
nl,ios,m, aawd9efri403
nl,ios,m, 39fnsefwfpiw3r
[country,os,device,md5_id] output should be
nl,windows,c
uk,mac,c
nl,ios,m
Like EdChum said the groupby returns groupby object so I added sum() and this worked in pandas 17, but I think this is now in 20 causing a problem, because there are no numeric columns.
Upvotes: 1
Views: 438
Reputation: 394409
To answer some of your queries, a groupby
object is just metadata, it describes how to perform the grouping, it only does some work when you call some aggregation function on it, as you have no numerical columns I'm not sure what you're expecting by calling sum
.
It looks like all you want is to drop_duplicates
:
df.drop_duplicates(subset=['country','os','device'])
So that what is left are non-repeated rows based on the passed subset
of columns
Upvotes: 1