How to make a new column based on difference of max values by index?

Question

Take the following multi-indexed dataframe:

index_1   index_2   cum_value
0         2020-01      100.00
0         2020-02       50.00 
0         2020-03      -50.00
0         2020-04      150.00
0         2020-05      200.00    
1         2020-01       25.00
1         2020-02       50.00
1         2020-03     -100.00
1         2020-04       50.00
1         2020-05      200.00

I need to create a new_col to compute the difference of last cum_value for each index_1, if this cum_value has increased on that month considering the past max value inside that index_1, on previous months.

The result should be like this:

index_1   index_2   cum_value   new_col
0         2020-01      100.00    100.00 --> first positive value on index_1 [0]
0         2020-02       50.00      0.00
0         2020-03      -50.00      0.00
0         2020-04      150.00     50.00 --> (150 - 100)
0         2020-05      200.00     50.00 --> (200 - 150)
1         2020-01       25.00     25.00 --> first positive value on index_1 [1]
1         2020-02       50.00     25.00 --> (50 - 25)
1         2020-03     -100.00      0.00
1         2020-04       50.00      0.00
1         2020-05      200.00    150.00 --> (200 - 50)

The first row with positive value on new_col must show this value. I don't need negative max values.

This is the rationale to calculate marginal values to pay some taxes.

Shubham Sharma · Accepted Answer

Code

c = df.groupby(level=0)['cum_value'].cummax()
m = df['cum_value'].ge(c) & df['cum_value'].ge(0)
df['new_col'] = df.loc[m, 'cum_value'].groupby(level=0).diff()
df['new_col'] = df['new_col'].fillna(df['cum_value']).mask(~m, 0)

Explanations

Let us group the dataframe on level=0 i.e. index_1 and transform the column cum_value using cummax to calculate the cumulative maximum value per level=0 group:

>>> c

index_1  index_2
0        2020-01    100.0
         2020-02    100.0
         2020-03    100.0
         2020-04    150.0
         2020-05    200.0
1        2020-01     25.0
         2020-02     50.0
         2020-03     50.0
         2020-04     50.0
         2020-05    200.0
Name: cum_value, dtype: float64

Now, compare the cum_value column with the above calculated cumulative max values to create a boolean mask. Note we are considering only the positive values in cum_value. The idea behind this boolean mask is that if the current month value is greater than or equal to the maximum value from the previous months, then the ouput of this mask will be True otherwise False.

>>> m

index_1  index_2
0        2020-01     True
         2020-02    False
         2020-03    False
         2020-04     True
         2020-05     True
1        2020-01     True
         2020-02     True
         2020-03    False
         2020-04     True
         2020-05     True
Name: cum_value, dtype: bool

As we are only interested in the values from the cum_value column that satisfies the above condition, we can use boolean masking to filter such values.

>>> df.loc[m, 'cum_value']

index_1  index_2
0        2020-01    100.0
         2020-04    150.0
         2020-05    200.0
1        2020-01     25.0
         2020-02     50.0
         2020-04     50.0
         2020-05    200.0
Name: cum_value, dtype: float64

Now group the above filtered values on level=0 i.e index_1 and use diff on the cum_value column to calculate the differences between the current value and previous max value:

>>> df.loc[m, 'cum_value'].groupby(level=0).diff()

index_1  index_2
0        2020-01      NaN
         2020-04     50.0
         2020-05     50.0
1        2020-01      NaN
         2020-02     25.0
         2020-04      0.0
         2020-05    150.0
Name: cum_value, dtype: float64

Finally, fill the NaN values in the newly created new_col and mask the values with 0 which do not satisfy the condition m:

>>> df
                 cum_value  new_col
index_1 index_2                    
0       2020-01      100.0    100.0
        2020-02       50.0      0.0
        2020-03      -50.0      0.0
        2020-04      150.0     50.0
        2020-05      200.0     50.0
1       2020-01       25.0     25.0
        2020-02       50.0     25.0
        2020-03     -100.0      0.0
        2020-04       50.0      0.0
        2020-05      200.0    150.0

How to make a new column based on difference of max values by index?

Answers (1)

Code

Explanations

Related Questions