Reputation: 365
Take the following multi-indexed dataframe:
index_1 index_2 cum_value
0 2020-01 100.00
0 2020-02 50.00
0 2020-03 -50.00
0 2020-04 150.00
0 2020-05 200.00
1 2020-01 25.00
1 2020-02 50.00
1 2020-03 -100.00
1 2020-04 50.00
1 2020-05 200.00
I need to create a new_col
to compute the difference of last cum_value
for each index_1
, if this cum_value
has increased on that month considering the past max value inside that index_1
, on previous months.
The result should be like this:
index_1 index_2 cum_value new_col
0 2020-01 100.00 100.00 --> first positive value on index_1 [0]
0 2020-02 50.00 0.00
0 2020-03 -50.00 0.00
0 2020-04 150.00 50.00 --> (150 - 100)
0 2020-05 200.00 50.00 --> (200 - 150)
1 2020-01 25.00 25.00 --> first positive value on index_1 [1]
1 2020-02 50.00 25.00 --> (50 - 25)
1 2020-03 -100.00 0.00
1 2020-04 50.00 0.00
1 2020-05 200.00 150.00 --> (200 - 50)
The first row with positive value on new_col
must show this value. I don't need negative max values.
This is the rationale to calculate marginal values to pay some taxes.
Upvotes: 2
Views: 78
Reputation: 71689
c = df.groupby(level=0)['cum_value'].cummax()
m = df['cum_value'].ge(c) & df['cum_value'].ge(0)
df['new_col'] = df.loc[m, 'cum_value'].groupby(level=0).diff()
df['new_col'] = df['new_col'].fillna(df['cum_value']).mask(~m, 0)
Let us group
the dataframe on level=0
i.e. index_1
and transform the column cum_value
using cummax
to calculate the cumulative maximum value per level=0
group:
>>> c
index_1 index_2
0 2020-01 100.0
2020-02 100.0
2020-03 100.0
2020-04 150.0
2020-05 200.0
1 2020-01 25.0
2020-02 50.0
2020-03 50.0
2020-04 50.0
2020-05 200.0
Name: cum_value, dtype: float64
Now, compare the cum_value
column with the above calculated cumulative max values to create a boolean mask. Note we are considering only the positive values in cum_value
. The idea behind this boolean mask is that if the current month value is greater than or equal to the maximum value from the previous months, then the ouput of this mask will be True
otherwise False
.
>>> m
index_1 index_2
0 2020-01 True
2020-02 False
2020-03 False
2020-04 True
2020-05 True
1 2020-01 True
2020-02 True
2020-03 False
2020-04 True
2020-05 True
Name: cum_value, dtype: bool
As we are only interested in the values from the cum_value
column that satisfies the above condition, we can use boolean masking to filter such values.
>>> df.loc[m, 'cum_value']
index_1 index_2
0 2020-01 100.0
2020-04 150.0
2020-05 200.0
1 2020-01 25.0
2020-02 50.0
2020-04 50.0
2020-05 200.0
Name: cum_value, dtype: float64
Now group
the above filtered values on level=0
i.e index_1
and use diff
on the cum_value
column to calculate the differences between the current value and previous max value:
>>> df.loc[m, 'cum_value'].groupby(level=0).diff()
index_1 index_2
0 2020-01 NaN
2020-04 50.0
2020-05 50.0
1 2020-01 NaN
2020-02 25.0
2020-04 0.0
2020-05 150.0
Name: cum_value, dtype: float64
Finally, fill the NaN
values in the newly created new_col
and mask the values with 0
which do not satisfy the condition m
:
>>> df
cum_value new_col
index_1 index_2
0 2020-01 100.0 100.0
2020-02 50.0 0.0
2020-03 -50.0 0.0
2020-04 150.0 50.0
2020-05 200.0 50.0
1 2020-01 25.0 25.0
2020-02 50.0 25.0
2020-03 -100.0 0.0
2020-04 50.0 0.0
2020-05 200.0 150.0
Upvotes: 3