Reputation: 285
I have been trying to wrap my head around a, in theory, simple task, but am having real difficulty coding it up. It is a kind of code test / brain teaser!
On page 14 of this document there is a holiday code ruleset that I am trying to translate from English to Python. I essentially want a column that states which rank (1-16, I am ignoring beyond that) that the particular date corresponds to: https://www.gasgovernance.co.uk/sites/default/files/ggf/2020-02/3.0%20Modelling%20Approach%202020_Final.pdf
If one was to have a dataset of many years length, for example from 2012 to 2022, how would I code up in Python this rule set? I don't want to do it manually as I might want to do the same thing for another country of something, so want to have the actual logic.
Below is my attempt so far. I start off with getting the holiday calendar for Great Britain. I then create a 'rank' column as a placeholder. Then I am trying to reference off these holidays, but the logic is quite difficult (for me).
Wondering if there is a better way or how exactly to do this?
import holidays
import numpy as np
import pandas as pd
dates = pd.date_range(start='2012-01-01', end='2022-12-31', freq='D')
df = pd.DataFrame({'Value': np.random.rand(len(dates))}, index=dates)
def get_british_holidays(df):
gb_holidays = holidays.UnitedKingdom()
holiday_dates = pd.Series(index=df.index)
for single_date in holiday_dates.index:
if single_date in gb_holidays:
holiday = gb_holidays[single_date]
holiday_parts = [part.strip() for part in holiday.split(',')]
holiday_parts = [part for part in holiday_parts if '[Northern Ireland]' not in part]
holiday_dates.loc[single_date] = ', '.join(holiday_parts)
holiday_dates = holiday_dates.replace('', np.nan).fillna(value=np.nan)
holiday_dates = holiday_dates.to_frame(name='holiday')
return pd.merge(df, holiday_dates, left_index=True, right_index=True, how='left')
df = get_british_holidays(df)
df['rank'] = 0
# first attempt
for date, rank in df.loc[df.index.month.isin([12, 1])].iterrows():
if df.loc[date, 'holiday'] == 'Christmas Day':
if date.dayofweek <= 3:
holiday_period_start = pd.Timestamp(date.year, date.month, date.day - 3 - date.dayofweek)
else:
holiday_period_start = pd.Timestamp(date.year, date.month, date.day - date.dayofweek)
if df.loc[date, 'holiday'] == 'New Year Holiday [Scotland]':
holiday_period_end = pd.Timestamp(date.year, date.month, date.day + date.dayofweek)
# second attempt
for date, rank in df.loc[df.index.month.isin([12, 1])].iterrows():
if date in pd.date_range(pd.Timestamp(date.year, 12, 24), pd.Timestamp(date.year + 1, 1, 2)):
df.loc[date, 'rank'] = 5
elif date == pd.Timestamp(date.year, 12, 25):
df.loc[date, 'rank'] = 1
elif date in pd.date_range(pd.Timestamp(date.year, 12, 26), pd.Timestamp(date.year, 12, 27)) or \
date in pd.date_range(pd.Timestamp(date.year, 1, 1), pd.Timestamp(date.year, 1, 2)):
df.loc[date, 'rank'] = 2
elif date.dayofweek < 5 and date in pd.date_range(pd.Timestamp(date.year, 12, 24),
pd.Timestamp(date.year, 1, 1)):
if df.loc[date, 'rank'] == 0:
df.loc[date, 'rank'] = 3
Upvotes: 1
Views: 234
Reputation: 332
I would break it down first by the obvious, the groups: "Christmas/New Year", Easter, "First Bank Holiday in May", etc. Taking care of one group at a time, try to find what the group main condition is. For example, let's work Christmas/New Year:
I would apply similar approach to the other groups. Each point i make here, i would turn into a small function to keep things organized and easy to read.
I am not clear on your intent of use for this. If it were me, i would store all days that are holidays and their codes in a dictionary first. Then, when done, convert it to whatever you are trying to actually use. Basically skip panda until the end.
Upvotes: 0