Reputation: 1561
I have a large dataframe with a date:
df = pd.read_json('{"dt":{"0":"2016-05-12","1":"2016-06-01","2":"2016-05-02","3":"2016-05-11","4":"2016-05-30","5":"2016-05-19","6":"2016-05-25","7":"2016-05-09","8":"2016-05-03","9":"2016-05-18"}}')
df
+------------+
| dt |
+------------+
| 2016-05-12 |
| 2016-06-01 |
| 2016-05-02 |
| 2016-05-11 |
| 2016-05-30 |
| 2016-05-19 |
| 2016-05-25 |
| 2016-05-09 |
| 2016-05-03 |
| 2016-05-18 |
+------------+
I also have a reference table that contains specific holidays. In reality, these holidays span several years and so the name is not unique but the combination of date and name, is.
dfHolidays = pd.read_json('{"event_dt":{"0":"2015-04-05","1":"2016-03-27","2":"2015-01-01","3":"2015-02-17","4":"2015-02-18","5":"2015-04-03","6":"2015-04-05","7":"2015-04-21","8":"2015-05-01","9":"2015-06-04","10":"2015-09-07","11":"2015-10-12","12":"2015-11-02","13":"2015-11-15","14":"2015-12-25","15":"2016-01-01","16":"2016-02-09","17":"2016-02-10","18":"2016-03-25","19":"2016-03-27","20":"2016-04-21","21":"2016-05-01","22":"2016-05-26","23":"2016-09-07","24":"2016-10-12","25":"2016-11-02","26":"2016-11-15","27":"2016-12-25"},"event_name":{"0":"Easter","1":"Easter","2":"Ano novo","3":"Carnaval","4":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","5":"Sexta-feira Santa","6":"P\\u00e1scoa","7":"Tiradentes","8":"Dia Mundial do Trabalho","9":"Corpus Christi","10":"Independ\\u00eancia do Brasil","11":"Nossa Senhora Aparecida","12":"Finados","13":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","14":"Natal","15":"Ano novo","16":"Carnaval","17":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","18":"Sexta-feira Santa","19":"P\\u00e1scoa","20":"Tiradentes","21":"Dia Mundial do Trabalho","22":"Corpus Christi","23":"Independ\\u00eancia do Brasil","24":"Nossa Senhora Aparecida","25":"Finados","26":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","27":"Natal"}}')
dfHolidays
+------------+--------------+
| holiday_dt | holiday_name |
+------------+--------------+
| 2015-04-05 | Easter |
| 2016-03-27 | Easter |
| 2015-01-01 | Ano novo |
| 2015-02-17 | Carnaval |
+------------+--------------+
I'd like to add features to the main dataframe df
Also, not just next any holiday but each specific one listed
Upvotes: 1
Views: 49
Reputation: 1112
Make sure you are using pandas version 1.1.0 and Try this
dfHolidays = pd.read_json('{"event_dt":{"0":"2015-04-05","1":"2016-03-27","2":"2015-01-01","3":"2015-02-17","4":"2015-02-18","5":"2015-04-03","6":"2015-04-05","7":"2015-04-21","8":"2015-05-01","9":"2015-06-04","10":"2015-09-07","11":"2015-10-12","12":"2015-11-02","13":"2015-11-15","14":"2015-12-25","15":"2016-01-01","16":"2016-02-09","17":"2016-02-10","18":"2016-03-25","19":"2016-03-27","20":"2016-04-21","21":"2016-05-01","22":"2016-05-26","23":"2016-09-07","24":"2016-10-12","25":"2016-11-02","26":"2016-11-15","27":"2016-12-25"},"event_name":{"0":"Easter","1":"Easter","2":"Ano novo","3":"Carnaval","4":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","5":"Sexta-feira Santa","6":"P\\u00e1scoa","7":"Tiradentes","8":"Dia Mundial do Trabalho","9":"Corpus Christi","10":"Independ\\u00eancia do Brasil","11":"Nossa Senhora Aparecida","12":"Finados","13":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","14":"Natal","15":"Ano novo","16":"Carnaval","17":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","18":"Sexta-feira Santa","19":"P\\u00e1scoa","20":"Tiradentes","21":"Dia Mundial do Trabalho","22":"Corpus Christi","23":"Independ\\u00eancia do Brasil","24":"Nossa Senhora Aparecida","25":"Finados","26":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","27":"Natal"}}')
dfHolidays.event_dt = dfHolidays.event_dt.apply(lambda d: dt.datetime.strptime(d,'%Y-%m-%d'))
df = pd.read_json('{"dt":{"0":"2016-05-12","1":"2016-06-01","2":"2016-05-02","3":"2016-05-11","4":"2016-05-30","5":"2016-05-19","6":"2016-05-25","7":"2016-05-09","8":"2016-05-03","9":"2016-05-18"}}')
df.dt = df.dt.apply(lambda d: dt.datetime.strptime(d,'%Y-%m-%d'))
df[['days_until_next_holiday', 'days_since_last_holiday']] = None
df['days_until_next_' + dfHolidays.event_name.values]=None
df['days_since_last_' + dfHolidays.event_name.values]=None
for i, r in df.iterrows():
df.loc[i, 'days_until_next_' + dfHolidays.event_name] = (dfHolidays.event_dt - r['dt']).apply(lambda x: x.days if x.days>0 else None).to_list()
df.loc[i, 'days_since_last_' + dfHolidays.event_name] = (r['dt'] - dfHolidays.event_dt).apply(lambda x: x.days if x.days>0 else None).to_list()
df.loc[i, 'days_until_next_holiday'] = df.loc[i, 'days_until_next_' + dfHolidays.event_name].min()
df.loc[i, 'days_since_last_holiday'] = df.loc[i, 'days_since_last_' + dfHolidays.event_name].min()
result=pd.DataFrame()
for c in df.columns:
if not df[c].isna().all():
result[c] = df[c]
df = result
print(df)
Upvotes: 0
Reputation: 59519
You can use pd.merge_asof
to grab the closest date in the future or past to create all of these columns. Essentially you need to change the direction and/or the susbet of dfHolidays
you use in the merge:
df = df.sort_values('dt')
dfHolidays = dfHolidays.sort_values('event_dt')
df['dt'] = pd.to_datetime(df['dt'])
dfHolidays['event_dt'] = pd.to_datetime(dfHolidays['event_dt'])
def calc_days(df, dfHolidays, direction):
""" Returns Series of integer # of days between events. Absolute value """
s = pd.merge_asof(df, dfHolidays, left_on='dt', right_on='event_dt',
direction=direction)
s = (s['event_dt'] - s['dt']).dt.days.abs()
return s
df['until_next'] = calc_days(df, dfHolidays, 'forward')
df['since_prev'] = calc_days(df, dfHolidays, 'backward')
df['until_next_Ano'] = calc_days(df, dfHolidays[dfHolidays['event_name'].eq('Ano novo')], 'forward')
df['since_prev_Ano'] = calc_days(df, dfHolidays[dfHolidays['event_name'].eq('Ano novo')], 'backward')
#...
print(df)
dt until_next since_prev until_next_Ano since_prev_Ano
2 2016-05-02 17 8 NaN 129
8 2016-05-03 100 4 NaN 150
7 2016-05-09 1 24 NaN 145
3 2016-05-11 15 10 NaN 131
0 2016-05-12 24 1 NaN 122
9 2016-05-18 98 6 NaN 152
5 2016-05-19 8 17 NaN 138
6 2016-05-25 7 18 NaN 139
4 2016-05-30 14 11 NaN 132
1 2016-06-01 23 2 NaN 123
As your provided dfHoliday had no future Ano novo dates compared with the dates in df
, the future gets set to NaN, as nothing merges:
dfHolidays[dfHolidays['event_name'].eq('Ano novo')]
# event_dt event_name
#2 2015-01-01 Ano novo
#15 2016-01-01 Ano novo
Upvotes: 2