Pandas cumsum on groupby not behaving as expected

Question

I have a dataframe like this:

df = pd.DataFrame({'prob':np.random.uniform(0,1,size), 'target':np.random.randint(0,2, size=size), 
              'pred':np.random.randint(0,2, size=size)})

That I want to compute cumsum of a groupby of a qcut:

df['box'] = pd.qcut(df['prob'], 10)

My expectation would be to calculate the cumulative function for each group, in order, but instead is calculating a sum for each element:

df['target_1'] = 1- df['target']
ch_curve = df.groupby('box').target.cumsum()/float(df.target.sum())
nch_curve = df.groupby('box').target_1.cumsum()/float(df.target_1.sum())

with the answer

0     0.000000
1     0.018182
2     0.018182
3     0.018182
4     0.000000
5     0.018182
6     0.018182
7     0.018182
8     0.036364
9     0.018182
10    0.000000
11    0.018182
12    0.018182
13    0.036364
14    0.000000
15    0.036364
16    0.036364
17    0.036364
18    0.054545
19    0.000000
20    0.000000
21    0.018182
22    0.018182
23    0.05454

instead of

'(0.0, 0.1)'    0.04
'(0.1, 0.2)'    0.12 #(0.08 + previous 0.04 )
'(0.2, 0.3)'    0.17 #(0.05 + previous 0.12 )

Alex · Accepted Answer

You want to calculate the percentage for each group and then take the cumsum.

In your original code df.groupby('box').target.cumsum() will take the cumsum of each group - so you will have one element for each of the elements in the grouped DataFrame. Then the division will be broadcast across all of these elements.

Instead you want to get one summary statistic for each group and then take the cumsum across these statistics.

ch_curve = (df.groupby('box').target.sum() / df.target.sum()).cumsum()
nch_curve = (df.groupby('box').target_1.sum() / df.target_1.sum()).cumsum()

Pandas cumsum on groupby not behaving as expected

Answers (1)

Related Questions