TiTo
TiTo

Reputation: 865

Shifting column in multiindex dataframe with missting dates

I'd like to shift a column in a multiindex dataframe in order to calculate a regression model with a lagged independent variable. As my time-series has missing values I only want to have the values shifted for known previous days. The df looks like that:

                cost
ID  day
1   31.01.2020  0
1   03.02.2020  0
1   04.02.2020  0.12
1   05.02.2020  0
1   06.02.2020  0
1   07.02.2020  0.08
1   10.02.2020  0
1   11.02.2020  0
1   12.02.2020  0.03
1   13.02.2020  0.1
1   14.02.2020  0

The desired output would like that:

                cost   cost_lag
ID  day
1   31.01.2020  0      NaN
1   03.02.2020  0      NaN
1   04.02.2020  0.12   0
1   05.02.2020  0      0.12
1   06.02.2020  0      0
1   07.02.2020  0.08   0
1   10.02.2020  0      NaN
1   11.02.2020  0      0
1   12.02.2020  0.03   0
1   13.02.2020  0.1    0.03
1   14.02.2020  0      0.1 

Based on this answer to a similar question I've tried the following:

df['cost_lag'] = df.groupby(['id'])['cost'].shift(1)[df.reset_index().day == df.reset_index().day.shift(1) + datetime.timedelta(days=1)]

But that results in an error message I don't understand:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

I've also tried to fill the missing dates following an approach suggested here:

ams_spend_ranking_df = ams_spend_ranking_df.index.get_level_values(1).apply(lambda x: datetime.datetime(x, 1, 1))

again resulting in an error message which does not enlighten me:

AttributeError: 'DatetimeIndex' object has no attribute 'apply'

Long story short: how can I shift the cost column by 1 day and add NaNs if I don't have data on the previous day?

Upvotes: 2

Views: 285

Answers (1)

jezrael
jezrael

Reputation: 862641

You can add all missing datetimes by DataFrameGroupBy.resample with Resampler.asfreq:

df1 = df.reset_index(level=0).groupby(['ID'])['cost'].resample('d').asfreq()
print (df1)
ID  day       
1   2020-01-31    0.00
    2020-02-01     NaN
    2020-02-02     NaN
    2020-02-03    0.00
    2020-02-04    0.12
    2020-02-05    0.00
    2020-02-06    0.00
    2020-02-07    0.08
    2020-02-08     NaN
    2020-02-09     NaN
    2020-02-10    0.00
    2020-02-11    0.00
    2020-02-12    0.03
    2020-02-13    0.10
    2020-02-14    0.00
Name: cost, dtype: float64

So then if use your solution with DataFrameGroupBy.shift it working like need:

df['cost_lag'] = df1.groupby('ID').shift(1)
print (df)
               cost  cost_lag
ID day                       
1  2020-01-31  0.00       NaN
   2020-02-03  0.00       NaN
   2020-02-04  0.12      0.00
   2020-02-05  0.00      0.12
   2020-02-06  0.00      0.00
   2020-02-07  0.08      0.00
   2020-02-10  0.00       NaN
   2020-02-11  0.00      0.00
   2020-02-12  0.03      0.00
   2020-02-13  0.10      0.03
   2020-02-14  0.00      0.10

Upvotes: 2

Related Questions