Andi3579
Andi3579

Reputation: 57

Pandas - Groupby object behavior

I am struggeling currently with the behavior of of the pandas groupby object. I have just switched from version 0.2.5 to 1.2.3 and my code does not behave the same anymore.

In version 0.2.5 when I did a groupby by multiple columns all lines where the result was 0 were basically dropped. But in the recent version I am using I get that all unique values from each columns are grouped leading to many lines showing 0 as a result thereof.

Code example:

df.groupby(['ColumnA', 'ColumnB'])['ColumnC'].count()

Result in 0.2.5:

ColumnA | ColumnB | Result of Count

Result in 1.2.3:
Column A - Value 1 | Column B - Value 1 | 2
Column A - Value 1 | Column B - Value 2 | 0
Column A - Value 2 | Column B - Value 1 | 0
Column A - Value 2 | Column B - Value 2 | 0

This creates a lot of unnecessary lines which are bascially useless. This becomes especially annoying when you work with large dataset of millions of lines and thousands of unique values per column. How can I force the behaviour from my previous version because this would mean that I would have to redo a lot of function which I have created. What did I missed in the transition from the different versions?

Upvotes: 3

Views: 785

Answers (1)

jezrael
jezrael

Reputation: 862661

It seems working with Categoricals, need parameter observed=True for avoid add missing categories to DataFrame.groupby:

observed, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

df.groupby(['ColumnA', 'ColumnB'], observed=True)['ColumnC'].count()

Upvotes: 2

Related Questions