Reputation: 96330
Consider a dataframe like the following :
A B C
0 foo b1 25
1 foo b2 400
2 foo b3 75
3 bar b1 100
4 bar b2 200
6 baz b2 100
7 baz b3 100
Say we want to build another dataframe that contains, for each group defined by A
, the fraction of C
each value of B
represents.
That is, using the example above, the result would be:
A B C
0 foo b1 0.050000
1 foo b2 0.800000
2 foo b3 0.150000
3 bar b1 0.333333
4 bar b2 0.666667
6 baz b2 0.500000
7 baz b3 0.500000
To see why, note that for row 0
, we have (25 + 400 + 75)/500 = 0.05
, etc.
Attempt 1:
df.groupby('A').transform(lambda x: x['C'].sum())
I end up with: TypeError: cannot concatenate a non-NDFrame object
Attempt 2: When I try with apply
I almost get what I need:
> df.groupby('A').apply(lambda x: x['C']/x['C'].sum()).reset_index()
A level_1 C
0 bar 3 0.333333
1 bar 4 0.666667
2 baz 6 0.500000
3 baz 7 0.500000
4 foo 0 0.050000
5 foo 1 0.800000
6 foo 2 0.150000
the only thing missing is that the column B
is "lost".
Attempt 3: If I to assign the result of apply
to a new column:
df['D'] = df.groupby('A').apply(lambda x: x['C']/x['C'].sum())
I get: TypeError: incompatible index of inserted column with frame index
.
What can I do to get exactly the output I need? This is all with Pandas 0.17 on Python 3.5.
Upvotes: 3
Views: 2224
Reputation: 109606
Extending the response of @AndyHayden, just assign the result to a new column:
df['C_pct'] = df.C / df.groupby("A")["C"].transform("sum")
>>> df
A B C C_pct
0 foo b1 25 0.050000
1 foo b2 400 0.800000
2 foo b3 75 0.150000
3 bar b1 100 0.333333
4 bar b2 200 0.666667
6 baz b2 100 0.500000
7 baz b3 100 0.500000
You could overwrite the original columns as well if you don't mind losing the original data:
df['C'] = ...
Upvotes: 1
Reputation: 375635
You can use transform, which "puts back" the aggregated results:
In [11]: df.groupby("A")["C"].transform("sum")
Out[11]:
0 500
1 500
2 500
3 300
4 300
6 200
7 200
dtype: int64
and divide (and assign this column back to C):
In [12]: df["C"] / df.groupby("A")["C"].transform("sum")
Out[12]:
0 0.050000
1 0.800000
2 0.150000
3 0.333333
4 0.666667
6 0.500000
7 0.500000
dtype: float64
Upvotes: 6