Iterator516
Iterator516

Reputation: 287

Confused about meaning of groupby operation with multiple columns with Pandas in Python

Here are my dataframe, function:

df = pd.DataFrame({
    'G': 'x x y y'.split(), 
    'C': [1, 2, 1, 2], 
    'D': [2, 2, 1, 1]})

def CD(df):
    return df['C'] * df['D']

Here is what my dataframe looks like:

   G  C  D
0  x  1  2
1  x  2  2
2  y  1  1
3  y  2  1

When I run

df.groupby('G').apply(CD)

I expected that it would sum over x and y to get

   G  C  D
0  x  3  4
1  y  3  2

Then, I expected it to multiply C and D to get

x   12
y   6

However, I got

G   
x  0    2
   1    4
y  2    1
   3    2

That new column of [2, 4, 1, 2] doesn't look any different than what I would have obtained if I simply ran

df['C'] * df['D']

Clearly, I am confused about what groupby does. What is "df.groupby('G').apply(CD)" doing in my example?

Upvotes: 2

Views: 126

Answers (5)

EBDS
EBDS

Reputation: 1734

Iterator516, I wanted to comment on your answer but I guess I still can't now.. not enough "reputation".

I'm also troubled by this groupby. So I started learning about it and also pipe/apply/applymap. I really enjoy looking at the output and understanding how all these works, like you are experiencing it now.

Sometimes, I find it easier to see the output of groupby simply by printing it out (since groupby produces an object which I can't easily look at how it organises the data directly).

for example...

df.groupby('G').apply(lambda x:x)

G   C   D
0   x   1   2
1   x   2   2
2   y   1   1
3   y   2   1
4   x   3   2

or

df.groupby('G').apply(print)

   G  C  D
0  x  1  2
1  x  2  2
4  x  3  2
   G  C  D
2  y  1  1
3  y  2  1

I also add an indicator as below to "breakup" the group to see it better.

df.groupby('G').apply(lambda x: print("***\n",x))

***
    G  C  D
0  x  1  2
1  x  2  2
4  x  3  2
***
    G  C  D
2  y  1  1
3  y  2  1

Once, I see this, I do an .apply() (or pipe/applymap) and see how it changes the output, eg how value_counts, count, sum changes this "intermediate" output. After some practices (actually took me quite some time), I get a better feel of how it works step by step.

Upvotes: 1

Iterator516
Iterator516

Reputation: 287

OK - thanks to everyone's answer. I understand what

df.groupby('G').apply(CD)

does now.

It re-organizes the rows by the groups in "G", so that all the x-rows are together, and then all the y-rows are together. Then, it applies whatever operation "CD" does. Finally, it removes the original columns "C" and "D", so that only the product of "CD" exists.

This is more visible if the original dataframe is

df = pd.DataFrame({
    'G': 'x x y y x'.split(), 
    'C': [1, 2, 1, 2, 3], 
    'D': [2, 2, 1, 1, 2]})

which looks like this:

   G  C  D
0  x  1  2
1  x  2  2
2  y  1  1
3  y  2  1
4  x  3  2

The first 2 rows are "x", and the 5th row is also "x". The groupby function re-organizes these rows so that all 3 x-rows are together, and then it applies the multiplication function to "C" and "D". The final result shows only the "G" column and the resulting product column.

G   
x  0    2
   1    4
   4    6
y  2    1
   3    2

Upvotes: 0

use the aggregate to apply multiple functions to the grouped fields

df = pd.DataFrame({
'G': 'x x y y'.split(), 
'C': [1, 2, 1, 2], 
'D': [2, 2, 1, 1]})
grouped=df.groupby('G')['C','D'].agg(['sum','mean'])
print(grouped)

output:

 C        D     
   sum mean sum mean
 G                  
 x   3  1.5   4    2
 y   3  1.5   2    1

Upvotes: 1

EBDS
EBDS

Reputation: 1734

Groupby does not do the sum. Try apply(sum) and sent the results to your function.

>> CD(df.groupby('G')[['C','D']].apply(sum))

G
x    12
y     6
dtype: int64

Upvotes: 2

jezrael
jezrael

Reputation: 863531

First aggregate sum and then pass function in DataFrame.pipe:

df = df.groupby('G').sum().pipe(CD)
print (df)
G
x    12
y     6
dtype: int64

What is "df.groupby('G').apply(CD)" doing in my example?

There is no aggregate function passed, so for each group is returned new Series with multiplied both columns.

You can check it if add print:

def CD(df):
    print (df['C'] * df['D'])
    return df['C'] * df['D']

0    2
1    4
dtype: int64
2    1
3    2
dtype: int64

df = df.groupby('G').apply(CD)
print (df)
G   
x  0    2
   1    4
y  2    1
   3    2
dtype: int64

Upvotes: 2

Related Questions