Reputation: 143
I have a dataset with two risk model scores and observations that have a certain amount of value. Something like this:
import pandas as pd
df = pd.DataFrame(data={'segment':['A','A','A','A','A','A','A','B','B','B','B','B'],
'model1':[9,4,5,2,9,7,7,8,8,5,6,3],
'model2':[9,8,2,4,6,8,8,7,7,7,4,4],
'dollars':[15,10,-5,-7,6,7,-2,5,7,3,-1,-3]},
columns=['segment','model1','model2','dollars'])
print df
segment model1 model2 dollars
0 A 9 9 15
1 A 4 8 10
2 A 5 2 -5
3 A 2 4 -7
4 A 9 6 6
5 A 7 8 7
6 A 7 8 -2
7 B 8 7 5
8 B 8 7 7
9 B 5 7 3
10 B 6 4 -1
11 B 3 4 -3
My goal is to determine the simultaneous risk model thresholds where value is maximized, i.e. a cutoff like (model1 >= X) & (model2 >= Y)
. The risk-models are both rank-ordered such that higher numbers are lower risk and generally higher value.
I was able to get the desired output using a loop approach:
df_sum = df.groupby(by=['segment','model1','model2'])['dollars'].agg(['sum']).rename(columns={'sum':'dollar_sum'}).reset_index()
df_sum.loc[:,'threshold_sum'] = 0
#this loop works but runs very slowly on my large dataframe
#calculate the sum of dollars for each combination of possible model score thresholds
for row in df_sum.itertuples():
#subset the original df down to just the observations above the given model scores
df_temp = df[((df['model1'] >= getattr(row,'model1')) & (df['model2'] >= getattr(row,'model2')) & (df['segment'] == getattr(row,'segment')))]
#calculate the sum and add it back to the dataframe
df_sum.loc[row.Index,'threshold_sum'] = df_temp['dollars'].sum()
#show the max value for each segment
print df_sum.loc[df_sum.groupby(by=['segment'])['threshold_sum'].idxmax()]
segment model1 model2 dollar_sum threshold_sum
1 A 4 8 10 30
7 B 5 7 3 15
The loop runs incredibly slowly as the size of the dataframe increases. I'm sure there's a faster way to do this (maybe using cumsum()
or numpy), but I'm stumped on what it is. Does anyone have a better way to do it? Ideally any code would be easily extendable to n-many risk models and would output all possible combinations of threshold_sum
in case I add other optimization criteria down the road.
Upvotes: 2
Views: 1486
Reputation: 143
Finally found a non-loop approach, it requires some re-shaping and cumsum()
.
df['cumsum_dollars'] = df['dollars']
df2 = pd.pivot_table(df,index=['segment','model1','model2'],values=['dollars','cumsum_dollars'],fill_value=0,aggfunc=np.sum)
# descending sort ensures that the cumsum happens in the desired direction
df2 = df2.unstack(fill_value=0).sort_index(ascending=False,axis=0).sort_index(ascending=False,axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 0 0 0
6 0 0 0 0 -1 0 0 0 0 0 -1 0
5 0 0 3 0 0 0 0 0 3 0 0 0
3 0 0 0 0 -3 0 0 0 0 0 -3 0
A 9 15 0 0 6 0 0 15 0 0 6 0 0
7 0 5 0 0 0 0 0 5 0 0 0 0
5 0 0 0 0 0 -5 0 0 0 0 0 -5
4 0 10 0 0 0 0 0 10 0 0 0 0
2 0 0 0 0 -7 0 0 0 0 0 -7 0
From here, take the cumulative sum in both the horizontal and vertical directions using the axis
parameter of the cumsum()
function.
df2['cumsum_dollars'] = df2['cumsum_dollars'].groupby(level='segment').cumsum(axis=0).cumsum(axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 12 12 12
6 0 0 0 0 -1 0 0 0 12 12 11 11
5 0 0 3 0 0 0 0 0 15 15 14 14
3 0 0 0 0 -3 0 0 0 15 15 11 11
A 9 15 0 0 6 0 0 15 15 15 21 21 21
7 0 5 0 0 0 0 15 20 20 26 26 26
5 0 0 0 0 0 -5 15 20 20 26 26 21
4 0 10 0 0 0 0 15 30 30 36 36 31
2 0 0 0 0 -7 0 15 30 30 36 29 24
With the cumulative sums calculated, shape the dataframe back into how it originally looked and take the max of each group.
df3 = df2.stack().reset_index()
print(df3.loc[df3.groupby(by='segment')['cumsum_dollars'].idxmax()])
segment model1 model2 cumsum_dollars dollars
43 A 4 4 36 0
14 B 5 6 15 0
These thresholds where there aren't any observations are actually more valuable than picking any of the options that do have data. Note that idxmax()
returns the first occurrence of the maximum, which is sufficient for my purposes. If you need to break ties, additional filtering/sorting would be be required before calling idxmax()
.
Upvotes: 1
Reputation: 21264
You'll get some speedup with apply()
, using your same approach, but I agree with your hunch, there's probably a faster way.
Here's an apply()
solution:
With df_sum
as:
df_sum = (df.groupby(['segment','model1','model2'])
.dollars
.sum()
.reset_index()
)
print(df_sum)
segment model1 model2 dollars
0 A 2 4 -7
1 A 4 8 10
2 A 5 2 -5
3 A 7 8 5
4 A 9 6 6
5 A 9 9 15
6 B 3 4 -3
7 B 5 7 3
8 B 6 4 -1
9 B 8 7 12
apply
can be combined with groupby
:
def get_threshold_sum(row):
return (df.loc[(df.segment == row.segment) &
(df.model1 >= row.model1) &
(df.model2 >= row.model2),
["segment","dollars"]]
.groupby('segment')
.sum()
.dollars
)
thresholds = df_sum.apply(get_threshold_sum, axis=1)
mask = thresholds.idxmax()
df_sum.loc[mask]
segment model1 model2 dollar_sum
1 A 4 8 10
7 B 5 7 3
To see all possible thresholds, just print the thresholds
list.
Upvotes: 1