Reputation: 57
I am struggeling currently with the behavior of of the pandas groupby object. I have just switched from version 0.2.5 to 1.2.3 and my code does not behave the same anymore.
In version 0.2.5 when I did a groupby by multiple columns all lines where the result was 0 were basically dropped. But in the recent version I am using I get that all unique values from each columns are grouped leading to many lines showing 0 as a result thereof.
Code example:
df.groupby(['ColumnA', 'ColumnB'])['ColumnC'].count()
Result in 0.2.5:
ColumnA | ColumnB | Result of Count
Result in 1.2.3:
Column A - Value 1 | Column B - Value 1 | 2
Column A - Value 1 | Column B - Value 2 | 0
Column A - Value 2 | Column B - Value 1 | 0
Column A - Value 2 | Column B - Value 2 | 0
This creates a lot of unnecessary lines which are bascially useless. This becomes especially annoying when you work with large dataset of millions of lines and thousands of unique values per column. How can I force the behaviour from my previous version because this would mean that I would have to redo a lot of function which I have created. What did I missed in the transition from the different versions?
Upvotes: 3
Views: 785
Reputation: 862661
It seems working with Categoricals, need parameter observed=True
for avoid add missing categories to DataFrame.groupby
:
observed, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
df.groupby(['ColumnA', 'ColumnB'], observed=True)['ColumnC'].count()
Upvotes: 2