13sen1
13sen1

Reputation: 285

Python, pandas DataFrame, use previous year, matching workday and week number from date as forecast

I have a daily time series DataFrame spanning many years. I want to create a naive forecast of 'this day last year' as a forecast for the next year, however I cannot just use something like df.shift(365) as I want to maintain the working days versus weekends. So I want to select the day that matches the year-1, weekday and week number, then use this as a forecast going forward (or something equally good).

I was debating whether to create two columns for weekday and week number, then create a pivot_table by year, then somehow shift the years along. But I went for this approach below (which does not really work, and is really slow).

def forecast(row):
    # set the year lookup to the previous year, means no values for the first year
    year = row.name.year - 1
    # match the weekday as not great to compare a working day to a weekend
    weekday = row.name.weekday()
    # match the week number too, as not great comparing e.g. Feb to May
    week = row.name.isocalendar().week

    return df.loc[(df.index.year == year) &
                  (df.index.weekday == weekday) &
                  (df.index.isocalendar().week == week)]


df['prediction'] = df.apply(lambda row: forecast(row), axis=1)

Below is a clip of my DataFrame, I would want a new column called 'Prediction' that would be nan for the first year, but from then on equal to the previous year's date with the same week number and workday.

Day         Value
2012-04-01   9.56
2012-04-02   9.37
2012-04-03   9.72
2012-04-04  11.27
2012-04-05  12.56
           ...
2022-03-27  10.52
2022-03-28  10.32
2022-03-29  11.19
2022-03-30  12.66
2022-03-31  15.34

How can I do this? In something like Excel, you could create columns for each parameter, then simply select the year, week number and working day you are after and drag down the formula. I am guessing there is a similar way using Pandas?

Upvotes: 0

Views: 956

Answers (2)

ouroboros1
ouroboros1

Reputation: 14184

Reviewing this question 2 years later, prompted by the comment of @wanna_be_qaunt, it occurs to me that one can use df.merge for this.

Data sample

import pandas as pd

idx = pd.DatetimeIndex(['2019-12-30', '2019-12-31', '2021-01-04', '2021-01-05'],
                       name='Day')
df = pd.DataFrame({'Value': range(4)}, index=idx)

            Value
Day              
2019-12-30      0 # 2020 isoweek 1, day 1
2019-12-31      1
2021-01-04      2 # 2021 isoweek 1, day 1
2021-01-05      3

Code

cols = ['week', 'day']

df['Prediction'] = (
    df.join(df.index.isocalendar())
    .assign(prev_year=lambda x: x.year - 1)
    .pipe(lambda x: x.reset_index()
                     .merge(x, left_on=['prev_year'] + cols, 
                            right_on=['year'] + cols,
                            suffixes=('', '_prev')
                            )
                     ).set_index('Day')['Value_prev']
    )

Output:

            Value  Prediction
Day                          
2019-12-30      0         NaN
2019-12-31      1         NaN
2021-01-04      2         0.0
2021-01-05      3         1.0

Explanation / intermediate

# df.join(...).assign(...)

            Value  year  week  day  prev_year
Day                                          
2019-12-30      0  2020     1    1       2019
2019-12-31      1  2020     1    2       2019
2021-01-04      2  2021     1    1       2020
2021-01-05      3  2021     1    2       2020
  • Use df.pipe to continue with the result, apply df.reset_index and merge with itself (without the index reset) on 'prev_year' for left, 'year' for year adding ['week', 'day'] on both sides.
  • Finally, restore the index ('Day') with df.set_index, select "Value_prev", and assign as df['Prediction'].

Addendum: hourly data

To answer the additional question of @wanna_be_quant: "can this be done for hourly data?"

Add hour=df.index.hour to df.assign + add 'hour' to cols:

cols = ['week', 'day', 'hour']

df['Prediction'] = (
    df.join(df.index.isocalendar())
    .assign(prev_year=lambda x: x.year - 1,
            hour=df.index.hour)
    .pipe(
        # as above
    )

Similarly, with seconds: seconds=(df.index - df.index.normalize()).total_seconds()) (see DatetimeIndex.normalize + DatetimeIndex.total_seconds.


Performance comparison

Using df.merge will be much faster than the previously suggested df.apply:

Test:

date_rng = pd.date_range('2000-01-01','2025-12-31', freq='D')
data = np.random.default_rng(0).uniform(low=1, high=10, size=(len(date_rng)))

df = pd.DataFrame(data, index=pd.Index(date_rng, name='Day'), columns=['Value'])

def offset_apply(df):
    df['Prediction'] = (
        df.index.isocalendar().apply(
            lambda x: datetime.fromisocalendar(x.year-1, x.week, x.day)
            if x.week != 53 else np.nan, axis=1).map(df.Value)
        )
    return df

def offset_merge(df):
    cols = ['week', 'day']
    
    df['Prediction'] = (
        df.join(df.index.isocalendar())
        .assign(prev_year=lambda x: x.year - 1)
        .pipe(lambda x: x.reset_index()
              .merge(x, left_on=['prev_year'] + cols, 
                     right_on=['year'] + cols,
                     suffixes=('', '_prev')
                     )
              ).set_index('Day')['Value_prev']
        )
    return df

Results:

%timeit offset_apply(df.copy())
1.37 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit offset_merge(df.copy())
11.1 ms ± 319 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Equality check:

offset_apply(df.copy()).equals(offset_merge(df.copy()))
# True

Upvotes: 1

ouroboros1
ouroboros1

Reputation: 14184

Maybe as follows.

Setup

import pandas as pd
import numpy as np
from datetime import datetime

date_rng = pd.date_range('2012-01-01','2014-12-31', freq='D')
data = np.random.uniform(low=1, high=10, size=(len(date_rng),))

df = pd.DataFrame(data, 
                  index=pd.Index(date_rng, name='Day'), 
                  columns=['Value'])

Code

df['Prediction'] = df.index.isocalendar().apply(
    lambda x: datetime.fromisocalendar(x.year-1, x.week, x.day), 
    axis=1).map(df.Value)

print(df[df.Prediction.notna()].head())

               Value  Prediction
Day                             
2012-12-30  8.103599    9.405936
2012-12-31  1.846680    1.559113
2013-01-01  9.155027    6.992394
2013-01-02  8.264289    3.495970
2013-01-03  1.571457    6.596947

It may seem odd that the notna data starts at 2012-12-30 rather than 2013-01-01, but this makes sense if you look at the isocalendar results:

df.index.isocalendar().loc['2012-12-30':'2012-12-31']

            year  week  day
Day                        
2012-12-30  2012    52    7
2012-12-31  2013     1    1

df.index.isocalendar().loc['2012-01-01':'2012-01-02']

            year  week  day
Day                        
2012-01-01  2011    52    7
2012-01-02  2012     1    1

# So that:
np.array_equal(df.loc['2012-01-01':'2012-01-02', 'Value'].to_numpy(), 
               df.loc['2012-12-30':'2012-12-31', 'Prediction'].to_numpy())
# True

Upvotes: 2

Related Questions