Ivan
Ivan

Reputation: 20101

Pandas cumsum on groupby not behaving as expected

I have a dataframe like this:

df = pd.DataFrame({'prob':np.random.uniform(0,1,size), 'target':np.random.randint(0,2, size=size), 
              'pred':np.random.randint(0,2, size=size)})

That I want to compute cumsum of a groupby of a qcut:

df['box'] = pd.qcut(df['prob'], 10)

My expectation would be to calculate the cumulative function for each group, in order, but instead is calculating a sum for each element:

df['target_1'] = 1- df['target']
ch_curve = df.groupby('box').target.cumsum()/float(df.target.sum())
nch_curve = df.groupby('box').target_1.cumsum()/float(df.target_1.sum())

with the answer

0     0.000000
1     0.018182
2     0.018182
3     0.018182
4     0.000000
5     0.018182
6     0.018182
7     0.018182
8     0.036364
9     0.018182
10    0.000000
11    0.018182
12    0.018182
13    0.036364
14    0.000000
15    0.036364
16    0.036364
17    0.036364
18    0.054545
19    0.000000
20    0.000000
21    0.018182
22    0.018182
23    0.05454

instead of

'(0.0, 0.1)'    0.04
'(0.1, 0.2)'    0.12 #(0.08 + previous 0.04 )
'(0.2, 0.3)'    0.17 #(0.05 + previous 0.12 )

Upvotes: 0

Views: 454

Answers (1)

Alex
Alex

Reputation: 19104

You want to calculate the percentage for each group and then take the cumsum.

In your original code df.groupby('box').target.cumsum() will take the cumsum of each group - so you will have one element for each of the elements in the grouped DataFrame. Then the division will be broadcast across all of these elements.

Instead you want to get one summary statistic for each group and then take the cumsum across these statistics.

ch_curve = (df.groupby('box').target.sum() / df.target.sum()).cumsum()
nch_curve = (df.groupby('box').target_1.sum() / df.target_1.sum()).cumsum()

Upvotes: 4

Related Questions