Reputation: 20101
I have a dataframe like this:
df = pd.DataFrame({'prob':np.random.uniform(0,1,size), 'target':np.random.randint(0,2, size=size),
'pred':np.random.randint(0,2, size=size)})
That I want to compute cumsum
of a groupby
of a qcut
:
df['box'] = pd.qcut(df['prob'], 10)
My expectation would be to calculate the cumulative function for each group, in order, but instead is calculating a sum for each element:
df['target_1'] = 1- df['target']
ch_curve = df.groupby('box').target.cumsum()/float(df.target.sum())
nch_curve = df.groupby('box').target_1.cumsum()/float(df.target_1.sum())
with the answer
0 0.000000
1 0.018182
2 0.018182
3 0.018182
4 0.000000
5 0.018182
6 0.018182
7 0.018182
8 0.036364
9 0.018182
10 0.000000
11 0.018182
12 0.018182
13 0.036364
14 0.000000
15 0.036364
16 0.036364
17 0.036364
18 0.054545
19 0.000000
20 0.000000
21 0.018182
22 0.018182
23 0.05454
instead of
'(0.0, 0.1)' 0.04
'(0.1, 0.2)' 0.12 #(0.08 + previous 0.04 )
'(0.2, 0.3)' 0.17 #(0.05 + previous 0.12 )
Upvotes: 0
Views: 454
Reputation: 19104
You want to calculate the percentage for each group and then take the cumsum.
In your original code df.groupby('box').target.cumsum()
will take the cumsum
of each group - so you will have one element for each of the elements in the grouped DataFrame. Then the division will be broadcast across all of these elements.
Instead you want to get one summary statistic for each group and then take the cumsum
across these statistics.
ch_curve = (df.groupby('box').target.sum() / df.target.sum()).cumsum()
nch_curve = (df.groupby('box').target_1.sum() / df.target_1.sum()).cumsum()
Upvotes: 4