Kev
Kev

Reputation: 25

Pandas groupby cut sum

I am trying to bin a column and take the sum for rows that satisfy an indexing condition.

Here is an example of the kind of DataFrame I'm working with:

df 

A    B    C    D
a    aa   0    1
a    aa   1    2
a    aa   2    10
a    bb   0    4
a    bb   1    5
a    bb   2    60
b    aa   0    7
b    aa   1    8
b    aa   2    90
b    bb   0    10
b    bb   1    11
b    bb   2    12

What I'd like to do is first bin the column of features C (ie 0,1 is f0 and 2 is f1) and sum the values in column D within each bin and still keep the respective A and B` index pairs intact. The result of which would look like:

A    B    C    D
a    aa   f0   3
a    aa   f1   10
a    bb   f0   9
a    bb   f1   60
b    aa   f0   15
b    aa   f1   90
b    bb   f0   21
b    bb   f1   12

I've done the binning using pandas.cut like so:

cut = pd.cut(df['C'], bins=[0,2,3], right=False, labels=['f0','f1'])

But I am having trouble keeping maintaining the index pairs (columns A,B) intact using df.groupby(cut) and I would like to not have to iterate through all possible (A,B) pairs and bin + sum them myself.

Any help would be appreciated.

Upvotes: 1

Views: 621

Answers (2)

Tom
Tom

Reputation: 8790

You can do a groupby on ['A','B','C'] after reassigning the values in C:

df['C'] = np.where(df['C'].isin([0,1]), 'f0', 'f1')
df = df.groupby(['A','B','C']).sum().reset_index()

Result:

   A   B   C   D
0  a  aa  f0   3
1  a  aa  f1  10
2  a  bb  f0   9
3  a  bb  f1  60
4  b  aa  f0  15
5  b  aa  f1  90
6  b  bb  f0  21
7  b  bb  f1  12

Upvotes: 1

Paul H
Paul H

Reputation: 68156

You need to add the additional columns that define your desired groups into your groupby statement:

from io import StringIO

import pandas

df = pandas.read_table(StringIO("""\
A    B    C    D
a    aa   0    1
a    aa   1    2
a    aa   2    10
a    bb   0    4
a    bb   1    5
a    bb   2    60
b    aa   0    7
b    aa   1    8
b    aa   2    90
b    bb   0    10
b    bb   1    11
b    bb   2    12"""), sep='\s+')

output = (
    df.assign(C=pandas.cut(df['C'], bins=[0,2,3], right=False, labels=['f0','f1']))
      .groupby(['A', 'B', 'C'], as_index=False)  
      .sum()  
)

 A   B   C   D
 a  aa  f0   3
 a  aa  f1  10
 a  bb  f0   9
 a  bb  f1  60
 b  aa  f0  15
 b  aa  f1  90
 b  bb  f0  21
 b  bb  f1  12

Upvotes: 1

Related Questions