Reputation: 121
I have a dataframe of economic series whose values can get revised every month, adding a new value for a given date and indexing it by realtime_start
(see below dataframe). realtime_start
indicates the date at which value
for date
becomes valid. This value
expires as soon as another one takes its place.
date | realtime_start | value |
---|---|---|
2020-11-01 | 2020-12-04 | 142629.0 |
2020-11-01 | 2021-01-08 | 142764.0 |
2020-11-01 | 2021-02-05 | 142809.0 |
2020-12-01 | 2021-01-08 | 142624.0 |
2020-12-01 | 2021-02-05 | 142582.0 |
2020-12-01 | 2021-03-05 | 142503.0 |
2021-01-01 | 2021-02-05 | 142631.0 |
2021-01-01 | 2021-03-05 | 142669.0 |
2021-01-01 | 2021-04-02 | 142736.0 |
2021-02-01 | 2021-03-05 | 143048.0 |
2021-02-01 | 2021-04-02 | 143204.0 |
2021-03-01 | 2021-04-02 | 144120.0 |
I would like an easy way to calculate the month-over-month change in value
based on the last known entry at date
.
Calculation method: take the first release from month n (based on realtime_start
) and subtract the relevant release from month n-1. Relevant release is the most recent release whose realtime_start
date does not exceed that of month n.
See desired output below
date | MoM change |
---|---|
2020-11-01 | NaN |
2020-12-01 | -140 |
2021-01-01 | 49 |
2021-02-01 | 379 |
2021-03-01 | 916 |
For 2021-03-01
, the MoM change value is 144120.0 - 143204.0 = 916.0
For 2021-02-01
, the MoM change value is 143048.0 - 142669.0 = 379.0
For 2021-01-01
, the MoM change value is 142631.0 - 142582.0 = 49.0
Similarly, I would like to calculate the year-over-year change based on the last known values at date
(actual data frame extends further into the past). I would also like to calculate the 3-month (rolling) average of month-over-month change based on last known values at date
.
Upvotes: 3
Views: 2275
Reputation: 71689
df = df.set_index('date')
first = df.groupby(level=0).first()
m = df['realtime_start'].le(first['realtime_start'].shift(-1))
last_val = df['value'].mask(~m).groupby(level=0).last().shift()
mom_change = (first['value'] - last_val).reset_index(name='MoM change')
Set the index
of the dataframe to the column date
then group
the dataframe on level=0
and aggregate using first
to select the first row for each unique date
>>> first
realtime_start value
date
2020-11-01 2020-12-04 142629.0
2020-12-01 2021-01-08 142624.0
2021-01-01 2021-02-05 142631.0
2021-02-01 2021-03-05 143048.0
2021-03-01 2021-04-02 144120.0
Shift the column realtime_start
in the first
dataframe, then compare it with realtime_start
column in df
to create a boolean mask m
>>> m
date
2020-11-01 True
2020-11-01 True
2020-11-01 False
2020-12-01 True
2020-12-01 True
2020-12-01 False
2021-01-01 True
2021-01-01 True
2021-01-01 False
2021-02-01 True
2021-02-01 True
2021-03-01 False
Name: realtime_start, dtype: bool
Now mask the values in the value
column using the above boolean mask then group this masked column on level=0
and aggregate using last to select last row for each unique id
>>> last
date
2020-11-01 NaN
2020-12-01 142764.0
2021-01-01 142582.0
2021-02-01 142669.0
2021-03-01 143204.0
Name: value, dtype: float64
Subtract the value
column in first
dataframe from the calculated last_val
column to calculate the MoM change
>>> mom_change
date MoM change
0 2020-11-01 NaN
1 2020-12-01 -140.0
2 2021-01-01 49.0
3 2021-02-01 379.0
4 2021-03-01 916.0
PS: The dataframe must be sorted on date
column in order for this solution to work properly
Upvotes: 2