Reputation: 287
Here are my dataframe, function:
df = pd.DataFrame({
'G': 'x x y y'.split(),
'C': [1, 2, 1, 2],
'D': [2, 2, 1, 1]})
def CD(df):
return df['C'] * df['D']
Here is what my dataframe looks like:
G C D
0 x 1 2
1 x 2 2
2 y 1 1
3 y 2 1
When I run
df.groupby('G').apply(CD)
I expected that it would sum over x and y to get
G C D
0 x 3 4
1 y 3 2
Then, I expected it to multiply C and D to get
x 12
y 6
However, I got
G
x 0 2
1 4
y 2 1
3 2
That new column of [2, 4, 1, 2] doesn't look any different than what I would have obtained if I simply ran
df['C'] * df['D']
Clearly, I am confused about what groupby does. What is "df.groupby('G').apply(CD)" doing in my example?
Upvotes: 2
Views: 126
Reputation: 1734
Iterator516, I wanted to comment on your answer but I guess I still can't now.. not enough "reputation".
I'm also troubled by this groupby. So I started learning about it and also pipe/apply/applymap. I really enjoy looking at the output and understanding how all these works, like you are experiencing it now.
Sometimes, I find it easier to see the output of groupby simply by printing it out (since groupby produces an object which I can't easily look at how it organises the data directly).
for example...
df.groupby('G').apply(lambda x:x)
G C D
0 x 1 2
1 x 2 2
2 y 1 1
3 y 2 1
4 x 3 2
or
df.groupby('G').apply(print)
G C D
0 x 1 2
1 x 2 2
4 x 3 2
G C D
2 y 1 1
3 y 2 1
I also add an indicator as below to "breakup" the group to see it better.
df.groupby('G').apply(lambda x: print("***\n",x))
***
G C D
0 x 1 2
1 x 2 2
4 x 3 2
***
G C D
2 y 1 1
3 y 2 1
Once, I see this, I do an .apply() (or pipe/applymap) and see how it changes the output, eg how value_counts, count, sum changes this "intermediate" output. After some practices (actually took me quite some time), I get a better feel of how it works step by step.
Upvotes: 1
Reputation: 287
OK - thanks to everyone's answer. I understand what
df.groupby('G').apply(CD)
does now.
It re-organizes the rows by the groups in "G", so that all the x-rows are together, and then all the y-rows are together. Then, it applies whatever operation "CD" does. Finally, it removes the original columns "C" and "D", so that only the product of "CD" exists.
This is more visible if the original dataframe is
df = pd.DataFrame({
'G': 'x x y y x'.split(),
'C': [1, 2, 1, 2, 3],
'D': [2, 2, 1, 1, 2]})
which looks like this:
G C D
0 x 1 2
1 x 2 2
2 y 1 1
3 y 2 1
4 x 3 2
The first 2 rows are "x", and the 5th row is also "x". The groupby function re-organizes these rows so that all 3 x-rows are together, and then it applies the multiplication function to "C" and "D". The final result shows only the "G" column and the resulting product column.
G
x 0 2
1 4
4 6
y 2 1
3 2
Upvotes: 0
Reputation: 4253
use the aggregate to apply multiple functions to the grouped fields
df = pd.DataFrame({
'G': 'x x y y'.split(),
'C': [1, 2, 1, 2],
'D': [2, 2, 1, 1]})
grouped=df.groupby('G')['C','D'].agg(['sum','mean'])
print(grouped)
output:
C D
sum mean sum mean
G
x 3 1.5 4 2
y 3 1.5 2 1
Upvotes: 1
Reputation: 1734
Groupby does not do the sum. Try apply(sum) and sent the results to your function.
>> CD(df.groupby('G')[['C','D']].apply(sum))
G
x 12
y 6
dtype: int64
Upvotes: 2
Reputation: 863531
First aggregate sum
and then pass function in DataFrame.pipe
:
df = df.groupby('G').sum().pipe(CD)
print (df)
G
x 12
y 6
dtype: int64
What is "df.groupby('G').apply(CD)" doing in my example?
There is no aggregate function passed, so for each group is returned new Series with multiplied both columns.
You can check it if add print
:
def CD(df):
print (df['C'] * df['D'])
return df['C'] * df['D']
0 2
1 4
dtype: int64
2 1
3 2
dtype: int64
df = df.groupby('G').apply(CD)
print (df)
G
x 0 2
1 4
y 2 1
3 2
dtype: int64
Upvotes: 2