Amelio Vazquez-Reina
Amelio Vazquez-Reina

Reputation: 96330

Fraction of group by column

Consider a dataframe like the following :

     A     B    C 
0  foo    b1   25
1  foo    b2  400
2  foo    b3   75  
3  bar    b1  100
4  bar    b2  200
6  baz    b2  100
7  baz    b3  100

Say we want to build another dataframe that contains, for each group defined by A, the fraction of C each value of B represents.

That is, using the example above, the result would be:

     A     B          C 
0  foo    b1   0.050000
1  foo    b2   0.800000
2  foo    b3   0.150000 
3  bar    b1   0.333333
4  bar    b2   0.666667
6  baz    b2   0.500000
7  baz    b3   0.500000

To see why, note that for row 0, we have (25 + 400 + 75)/500 = 0.05, etc.


Attempts:

Attempt 1:

df.groupby('A').transform(lambda x: x['C'].sum())

I end up with: TypeError: cannot concatenate a non-NDFrame object

Attempt 2: When I try with apply I almost get what I need:

> df.groupby('A').apply(lambda x: x['C']/x['C'].sum()).reset_index()

     A  level_1         C
0  bar        3  0.333333
1  bar        4  0.666667
2  baz        6  0.500000
3  baz        7  0.500000
4  foo        0  0.050000
5  foo        1  0.800000
6  foo        2  0.150000

the only thing missing is that the column B is "lost".

Attempt 3: If I to assign the result of apply to a new column:

df['D'] = df.groupby('A').apply(lambda x: x['C']/x['C'].sum())

I get: TypeError: incompatible index of inserted column with frame index.


What can I do to get exactly the output I need? This is all with Pandas 0.17 on Python 3.5.

Upvotes: 3

Views: 2224

Answers (2)

Alexander
Alexander

Reputation: 109606

Extending the response of @AndyHayden, just assign the result to a new column:

df['C_pct'] = df.C / df.groupby("A")["C"].transform("sum")

>>> df
     A   B    C     C_pct
0  foo  b1   25  0.050000
1  foo  b2  400  0.800000
2  foo  b3   75  0.150000
3  bar  b1  100  0.333333
4  bar  b2  200  0.666667
6  baz  b2  100  0.500000
7  baz  b3  100  0.500000

You could overwrite the original columns as well if you don't mind losing the original data:

df['C'] = ...

Upvotes: 1

Andy Hayden
Andy Hayden

Reputation: 375635

You can use transform, which "puts back" the aggregated results:

In [11]: df.groupby("A")["C"].transform("sum")
Out[11]:
0    500
1    500
2    500
3    300
4    300
6    200
7    200
dtype: int64

and divide (and assign this column back to C):

In [12]: df["C"] / df.groupby("A")["C"].transform("sum")
Out[12]:
0    0.050000
1    0.800000
2    0.150000
3    0.333333
4    0.666667
6    0.500000
7    0.500000
dtype: float64

Upvotes: 6

Related Questions