D. Ross
D. Ross

Reputation: 143

Pandas sum above all possible thresholds

I have a dataset with two risk model scores and observations that have a certain amount of value. Something like this:

import pandas as pd
df = pd.DataFrame(data={'segment':['A','A','A','A','A','A','A','B','B','B','B','B'],
                      'model1':[9,4,5,2,9,7,7,8,8,5,6,3],
                      'model2':[9,8,2,4,6,8,8,7,7,7,4,4],
                      'dollars':[15,10,-5,-7,6,7,-2,5,7,3,-1,-3]},
                      columns=['segment','model1','model2','dollars'])
print df

   segment  model1  model2  dollars
0        A       9       9       15
1        A       4       8       10
2        A       5       2       -5
3        A       2       4       -7
4        A       9       6        6
5        A       7       8        7
6        A       7       8       -2
7        B       8       7        5
8        B       8       7        7
9        B       5       7        3
10       B       6       4       -1
11       B       3       4       -3

My goal is to determine the simultaneous risk model thresholds where value is maximized, i.e. a cutoff like (model1 >= X) & (model2 >= Y). The risk-models are both rank-ordered such that higher numbers are lower risk and generally higher value.

I was able to get the desired output using a loop approach:

df_sum = df.groupby(by=['segment','model1','model2'])['dollars'].agg(['sum']).rename(columns={'sum':'dollar_sum'}).reset_index()
df_sum.loc[:,'threshold_sum'] = 0

#this loop works but runs very slowly on my large dataframe
#calculate the sum of dollars for each combination of possible model score thresholds
for row in df_sum.itertuples():
    #subset the original df down to just the observations above the given model scores
    df_temp = df[((df['model1'] >= getattr(row,'model1')) & (df['model2'] >= getattr(row,'model2')) & (df['segment'] == getattr(row,'segment')))]
    #calculate the sum and add it back to the dataframe
    df_sum.loc[row.Index,'threshold_sum'] = df_temp['dollars'].sum()

#show the max value for each segment
print df_sum.loc[df_sum.groupby(by=['segment'])['threshold_sum'].idxmax()]

  segment  model1  model2  dollar_sum  threshold_sum
1       A       4       8          10             30
7       B       5       7           3             15

The loop runs incredibly slowly as the size of the dataframe increases. I'm sure there's a faster way to do this (maybe using cumsum() or numpy), but I'm stumped on what it is. Does anyone have a better way to do it? Ideally any code would be easily extendable to n-many risk models and would output all possible combinations of threshold_sum in case I add other optimization criteria down the road.

Upvotes: 2

Views: 1486

Answers (2)

D. Ross
D. Ross

Reputation: 143

Finally found a non-loop approach, it requires some re-shaping and cumsum().

df['cumsum_dollars'] = df['dollars']
df2 = pd.pivot_table(df,index=['segment','model1','model2'],values=['dollars','cumsum_dollars'],fill_value=0,aggfunc=np.sum)
# descending sort ensures that the cumsum happens in the desired direction
df2 = df2.unstack(fill_value=0).sort_index(ascending=False,axis=0).sort_index(ascending=False,axis=1)
print(df2)

               dollars                  cumsum_dollars                 
model2               9   8   7  6  4  2              9   8   7  6  4  2
segment model1                                                         
B       8            0   0  12  0  0  0              0   0  12  0  0  0
        6            0   0   0  0 -1  0              0   0   0  0 -1  0
        5            0   0   3  0  0  0              0   0   3  0  0  0
        3            0   0   0  0 -3  0              0   0   0  0 -3  0
A       9           15   0   0  6  0  0             15   0   0  6  0  0
        7            0   5   0  0  0  0              0   5   0  0  0  0
        5            0   0   0  0  0 -5              0   0   0  0  0 -5
        4            0  10   0  0  0  0              0  10   0  0  0  0
        2            0   0   0  0 -7  0              0   0   0  0 -7  0

From here, take the cumulative sum in both the horizontal and vertical directions using the axis parameter of the cumsum() function.

df2['cumsum_dollars'] = df2['cumsum_dollars'].groupby(level='segment').cumsum(axis=0).cumsum(axis=1)
print(df2)

               dollars                  cumsum_dollars                    
model2               9   8   7  6  4  2              9   8   7   6   4   2
segment model1                                                            
B       8            0   0  12  0  0  0              0   0  12  12  12  12
        6            0   0   0  0 -1  0              0   0  12  12  11  11
        5            0   0   3  0  0  0              0   0  15  15  14  14
        3            0   0   0  0 -3  0              0   0  15  15  11  11
A       9           15   0   0  6  0  0             15  15  15  21  21  21
        7            0   5   0  0  0  0             15  20  20  26  26  26
        5            0   0   0  0  0 -5             15  20  20  26  26  21
        4            0  10   0  0  0  0             15  30  30  36  36  31
        2            0   0   0  0 -7  0             15  30  30  36  29  24

With the cumulative sums calculated, shape the dataframe back into how it originally looked and take the max of each group.

df3 = df2.stack().reset_index()
print(df3.loc[df3.groupby(by='segment')['cumsum_dollars'].idxmax()])

    segment  model1  model2  cumsum_dollars  dollars
43       A       4       4              36        0
14       B       5       6              15        0  

These thresholds where there aren't any observations are actually more valuable than picking any of the options that do have data. Note that idxmax() returns the first occurrence of the maximum, which is sufficient for my purposes. If you need to break ties, additional filtering/sorting would be be required before calling idxmax().

Upvotes: 1

andrew_reece
andrew_reece

Reputation: 21264

You'll get some speedup with apply(), using your same approach, but I agree with your hunch, there's probably a faster way.
Here's an apply() solution:

With df_sum as:

df_sum = (df.groupby(['segment','model1','model2'])
            .dollars
            .sum()
            .reset_index()
         )

print(df_sum)
  segment  model1  model2  dollars
0       A       2       4       -7
1       A       4       8       10
2       A       5       2       -5
3       A       7       8        5
4       A       9       6        6
5       A       9       9       15
6       B       3       4       -3
7       B       5       7        3
8       B       6       4       -1
9       B       8       7       12

apply can be combined with groupby:

def get_threshold_sum(row):
    return (df.loc[(df.segment == row.segment) & 
                   (df.model1 >= row.model1) & 
                   (df.model2 >= row.model2), 
                   ["segment","dollars"]]
              .groupby('segment')
              .sum()
              .dollars
           )

thresholds = df_sum.apply(get_threshold_sum, axis=1)
mask = thresholds.idxmax()

df_sum.loc[mask]
  segment  model1  model2  dollar_sum
1       A       4       8          10
7       B       5       7           3

To see all possible thresholds, just print the thresholds list.

Upvotes: 1

Related Questions