Christian Adib
Christian Adib

Reputation: 121

Calculate month over month and year over year change for vintage data

I have a dataframe of economic series whose values can get revised every month, adding a new value for a given date and indexing it by realtime_start (see below dataframe). realtime_start indicates the date at which value for date becomes valid. This value expires as soon as another one takes its place.

date realtime_start value
2020-11-01 2020-12-04 142629.0
2020-11-01 2021-01-08 142764.0
2020-11-01 2021-02-05 142809.0
2020-12-01 2021-01-08 142624.0
2020-12-01 2021-02-05 142582.0
2020-12-01 2021-03-05 142503.0
2021-01-01 2021-02-05 142631.0
2021-01-01 2021-03-05 142669.0
2021-01-01 2021-04-02 142736.0
2021-02-01 2021-03-05 143048.0
2021-02-01 2021-04-02 143204.0
2021-03-01 2021-04-02 144120.0

I would like an easy way to calculate the month-over-month change in value based on the last known entry at date.

Calculation method: take the first release from month n (based on realtime_start) and subtract the relevant release from month n-1. Relevant release is the most recent release whose realtime_start date does not exceed that of month n.

See desired output below

date MoM change
2020-11-01 NaN
2020-12-01 -140
2021-01-01 49
2021-02-01 379
2021-03-01 916

For 2021-03-01, the MoM change value is 144120.0 - 143204.0 = 916.0
For 2021-02-01, the MoM change value is 143048.0 - 142669.0 = 379.0
For 2021-01-01, the MoM change value is 142631.0 - 142582.0 = 49.0

Similarly, I would like to calculate the year-over-year change based on the last known values at date (actual data frame extends further into the past). I would also like to calculate the 3-month (rolling) average of month-over-month change based on last known values at date.

economic series dataframe

Upvotes: 3

Views: 2275

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Solution

df = df.set_index('date')

first = df.groupby(level=0).first()
m = df['realtime_start'].le(first['realtime_start'].shift(-1))
last_val = df['value'].mask(~m).groupby(level=0).last().shift()

mom_change = (first['value'] - last_val).reset_index(name='MoM change')

Explanations

Set the index of the dataframe to the column date then group the dataframe on level=0 and aggregate using first to select the first row for each unique date

>>> first
           realtime_start     value
date                               
2020-11-01     2020-12-04  142629.0
2020-12-01     2021-01-08  142624.0
2021-01-01     2021-02-05  142631.0
2021-02-01     2021-03-05  143048.0
2021-03-01     2021-04-02  144120.0

Shift the column realtime_start in the first dataframe, then compare it with realtime_start column in df to create a boolean mask m

>>> m

date
2020-11-01     True
2020-11-01     True
2020-11-01    False
2020-12-01     True
2020-12-01     True
2020-12-01    False
2021-01-01     True
2021-01-01     True
2021-01-01    False
2021-02-01     True
2021-02-01     True
2021-03-01    False
Name: realtime_start, dtype: bool

Now mask the values in the value column using the above boolean mask then group this masked column on level=0 and aggregate using last to select last row for each unique id

>>> last

date
2020-11-01         NaN
2020-12-01    142764.0
2021-01-01    142582.0
2021-02-01    142669.0
2021-03-01    143204.0
Name: value, dtype: float64

Subtract the value column in first dataframe from the calculated last_val column to calculate the MoM change

>>> mom_change

        date  MoM change
0 2020-11-01         NaN
1 2020-12-01      -140.0
2 2021-01-01        49.0
3 2021-02-01       379.0
4 2021-03-01       916.0

PS: The dataframe must be sorted on date column in order for this solution to work properly

Upvotes: 2

Related Questions