Reputation: 5862
I am converting a function to pandas that loops over a collection and updates each value based on a conditional and a running total. The function looks like this
def calculate_value():
cumulative_amount = 0
for row in rows:
if row['amount'] < 0:
return 0
amount = 0
if row['kind'] == 'A':
amount = row['amount'] * row['input_amount']
elif row['kind'] == 'B':
amount = row['input_amount'] - cumulative_amount
elif row['kind'] == 'C':
amount = row['amount']
cumulative_amount += amount
row['result'] = amount
if row['kind'] == 'B':
break
return rows
Basically, loop over all the rows, and add a result
value. But this result
may depend on a cumulative running total. Further, if we hit a certain value (row['kind'] == 'B'
) we should break and stop processing new rows.
When converting this to pandas, it seems like I should be using apply
. So far, I have the below code, which almost works, but when I try to get cumulative_amount
with shift(-1)
, it's always coming back as nan
.
What's the best way to do this in pandas?
def calculate_value(row: Series):
if row['amount'] < 0 or row.shift(-1)['kind'] == 'B':
row['cumulative_amount'] = 0
row['result'] = 0
return row
amount = 0
if np.isnan(row.shift(-1)['cumulative_amount']):
cumulative_amount = 0
else:
cumulative_amount = row.shift(-1)['cumulative_amount']
if row['kind'] == 'A':
amount = row['amount'] * row['input_amount']
elif row['kind'] == 'B':
amount = row['input_amount'] - cumulative_amount
elif row['kind'] == 'C':
amount = row['amount']
row['cumulative_amount'] = amount + cumulative_amount
row['result'] = amount
return row
df['cumulative_amount'] = 0
new_df = df.apply(lambda x: calculate_value(x), axis=1)
An example of input and desired output are
df = pd.DataFrame({
'kind': {1: 'C', 2: 'E', 3: 'A', 4: 'A', 5: 'B', 6: 'C'},
'amount': {1: -800, 2: 100, 3: 0.5, 4: 0.5, 5: 0, 6: 200},
'input_amount': {1: 800, 2: 800, 3: 800, 4: 800, 5: 800, 6: 800}
})
amount input_amount kind cumulative_amount result
1 -800.0 800 C 0.0 0.0
2 100.0 800 E 0.0 0.0
3 0.5 800 A 400.0 400.0
4 0.5 800 A 800.0 400.0
5 0.0 800 B 800.0 0.0
6 200.0 800 C 800.0 0.0
Upvotes: 0
Views: 1107
Reputation: 3130
If I understand this correctly, only result
for kind 'B'
depends on other rows. So you can start by doing everything else first:
df['result'] = 0.
a = (df.kind == 'A') & (df.amount >= 0)
c = (df.kind == 'C') & (df.amount >= 0)
df.loc[a, 'result'] = df.loc[a, 'amount'] * df.loc[a, 'input_amount']
df.loc[c, 'result'] = df.loc[c, 'amount']
Do the cumsum:
df['cumulative_amount'] = df.result.cumsum()
Correct the value of 'cumulative_amount'
(for all occurrences of type 'B'
):
df.loc[(df.kind == 'B'), 'result'] = df.loc[(df.kind == 'B'), 'input_amount'].values - df.loc[(df.kind.shift(-1) == 'B'), 'cumulative_amount'].values
Correct the values of 'result'
and 'cumulative_amount'
after the first occurrence of 'B'
:
df.loc[(df.kind == 'B').cumsum().shift() > 0, 'result'] = 0
# (df.kind == 'B').cumsum().shift() is a running count of the number of B's encountered prior to the row index,
# so you want to 'stop' once this number is no longer zero
# You could of course do this more simply by figuring out which position in the index has the first B,
# then using .ix or .iloc, but it's actually longer to type out.
df['cumulative_amount'] = df.result.cumsum() # Once more, because we've changed the value of results below B.
Upvotes: 1