Reputation: 285
I have a daily time series DataFrame spanning many years. I want to create a naive forecast of 'this day last year' as a forecast for the next year, however I cannot just use something like df.shift(365)
as I want to maintain the working days versus weekends. So I want to select the day that matches the year-1, weekday and week number, then use this as a forecast going forward (or something equally good).
I was debating whether to create two columns for weekday and week number, then create a pivot_table by year, then somehow shift the years along. But I went for this approach below (which does not really work, and is really slow).
def forecast(row):
# set the year lookup to the previous year, means no values for the first year
year = row.name.year - 1
# match the weekday as not great to compare a working day to a weekend
weekday = row.name.weekday()
# match the week number too, as not great comparing e.g. Feb to May
week = row.name.isocalendar().week
return df.loc[(df.index.year == year) &
(df.index.weekday == weekday) &
(df.index.isocalendar().week == week)]
df['prediction'] = df.apply(lambda row: forecast(row), axis=1)
Below is a clip of my DataFrame, I would want a new column called 'Prediction' that would be nan for the first year, but from then on equal to the previous year's date with the same week number and workday.
Day Value
2012-04-01 9.56
2012-04-02 9.37
2012-04-03 9.72
2012-04-04 11.27
2012-04-05 12.56
...
2022-03-27 10.52
2022-03-28 10.32
2022-03-29 11.19
2022-03-30 12.66
2022-03-31 15.34
How can I do this? In something like Excel, you could create columns for each parameter, then simply select the year, week number and working day you are after and drag down the formula. I am guessing there is a similar way using Pandas?
Upvotes: 0
Views: 956
Reputation: 14184
Reviewing this question 2 years later, prompted by the comment of @wanna_be_qaunt, it occurs to me that one can use df.merge
for this.
Data sample
import pandas as pd
idx = pd.DatetimeIndex(['2019-12-30', '2019-12-31', '2021-01-04', '2021-01-05'],
name='Day')
df = pd.DataFrame({'Value': range(4)}, index=idx)
Value
Day
2019-12-30 0 # 2020 isoweek 1, day 1
2019-12-31 1
2021-01-04 2 # 2021 isoweek 1, day 1
2021-01-05 3
Code
cols = ['week', 'day']
df['Prediction'] = (
df.join(df.index.isocalendar())
.assign(prev_year=lambda x: x.year - 1)
.pipe(lambda x: x.reset_index()
.merge(x, left_on=['prev_year'] + cols,
right_on=['year'] + cols,
suffixes=('', '_prev')
)
).set_index('Day')['Value_prev']
)
Output:
Value Prediction
Day
2019-12-30 0 NaN
2019-12-31 1 NaN
2021-01-04 2 0.0
2021-01-05 3 1.0
Explanation / intermediate
DatetimeIndex.isocalendar
+ df.join
.df.assign
.# df.join(...).assign(...)
Value year week day prev_year
Day
2019-12-30 0 2020 1 1 2019
2019-12-31 1 2020 1 2 2019
2021-01-04 2 2021 1 1 2020
2021-01-05 3 2021 1 2 2020
df.pipe
to continue with the result, apply df.reset_index
and merge with itself (without the index reset) on 'prev_year' for left, 'year' for year adding ['week', 'day']
on both sides.df.set_index
, select "Value_prev", and assign as df['Prediction']
.Addendum: hourly data
To answer the additional question of @wanna_be_quant: "can this be done for hourly data?"
Add hour=df.index.hour
to df.assign
+ add 'hour' to cols
:
cols = ['week', 'day', 'hour']
df['Prediction'] = (
df.join(df.index.isocalendar())
.assign(prev_year=lambda x: x.year - 1,
hour=df.index.hour)
.pipe(
# as above
)
Similarly, with seconds: seconds=(df.index - df.index.normalize()).total_seconds())
(see DatetimeIndex.normalize
+ DatetimeIndex.total_seconds
.
Performance comparison
Using df.merge
will be much faster than the previously suggested df.apply
:
Test:
date_rng = pd.date_range('2000-01-01','2025-12-31', freq='D')
data = np.random.default_rng(0).uniform(low=1, high=10, size=(len(date_rng)))
df = pd.DataFrame(data, index=pd.Index(date_rng, name='Day'), columns=['Value'])
def offset_apply(df):
df['Prediction'] = (
df.index.isocalendar().apply(
lambda x: datetime.fromisocalendar(x.year-1, x.week, x.day)
if x.week != 53 else np.nan, axis=1).map(df.Value)
)
return df
def offset_merge(df):
cols = ['week', 'day']
df['Prediction'] = (
df.join(df.index.isocalendar())
.assign(prev_year=lambda x: x.year - 1)
.pipe(lambda x: x.reset_index()
.merge(x, left_on=['prev_year'] + cols,
right_on=['year'] + cols,
suffixes=('', '_prev')
)
).set_index('Day')['Value_prev']
)
return df
Results:
%timeit offset_apply(df.copy())
1.37 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit offset_merge(df.copy())
11.1 ms ± 319 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Equality check:
offset_apply(df.copy()).equals(offset_merge(df.copy()))
# True
Upvotes: 1
Reputation: 14184
Maybe as follows.
datetime.date.isocalendar
to get year, week, and day.datetime.date.fromisocalendar
, but minus 1 year.pd.Series.map
to retrieve the previous-year values.Setup
import pandas as pd
import numpy as np
from datetime import datetime
date_rng = pd.date_range('2012-01-01','2014-12-31', freq='D')
data = np.random.uniform(low=1, high=10, size=(len(date_rng),))
df = pd.DataFrame(data,
index=pd.Index(date_rng, name='Day'),
columns=['Value'])
Code
df['Prediction'] = df.index.isocalendar().apply(
lambda x: datetime.fromisocalendar(x.year-1, x.week, x.day),
axis=1).map(df.Value)
print(df[df.Prediction.notna()].head())
Value Prediction
Day
2012-12-30 8.103599 9.405936
2012-12-31 1.846680 1.559113
2013-01-01 9.155027 6.992394
2013-01-02 8.264289 3.495970
2013-01-03 1.571457 6.596947
It may seem odd that the notna
data starts at 2012-12-30
rather than 2013-01-01
, but this makes sense if you look at the isocalendar
results:
df.index.isocalendar().loc['2012-12-30':'2012-12-31']
year week day
Day
2012-12-30 2012 52 7
2012-12-31 2013 1 1
df.index.isocalendar().loc['2012-01-01':'2012-01-02']
year week day
Day
2012-01-01 2011 52 7
2012-01-02 2012 1 1
# So that:
np.array_equal(df.loc['2012-01-01':'2012-01-02', 'Value'].to_numpy(),
df.loc['2012-12-30':'2012-12-31', 'Prediction'].to_numpy())
# True
Upvotes: 2