Josh
Josh

Reputation: 1561

Adding Time Until Holiday, referencing another dataframe

I have a large dataframe with a date:

df = pd.read_json('{"dt":{"0":"2016-05-12","1":"2016-06-01","2":"2016-05-02","3":"2016-05-11","4":"2016-05-30","5":"2016-05-19","6":"2016-05-25","7":"2016-05-09","8":"2016-05-03","9":"2016-05-18"}}')
df
+------------+
|     dt     |
+------------+
| 2016-05-12 |
| 2016-06-01 |
| 2016-05-02 |
| 2016-05-11 |
| 2016-05-30 |
| 2016-05-19 |
| 2016-05-25 |
| 2016-05-09 |
| 2016-05-03 |
| 2016-05-18 |
+------------+

I also have a reference table that contains specific holidays. In reality, these holidays span several years and so the name is not unique but the combination of date and name, is.

dfHolidays = pd.read_json('{"event_dt":{"0":"2015-04-05","1":"2016-03-27","2":"2015-01-01","3":"2015-02-17","4":"2015-02-18","5":"2015-04-03","6":"2015-04-05","7":"2015-04-21","8":"2015-05-01","9":"2015-06-04","10":"2015-09-07","11":"2015-10-12","12":"2015-11-02","13":"2015-11-15","14":"2015-12-25","15":"2016-01-01","16":"2016-02-09","17":"2016-02-10","18":"2016-03-25","19":"2016-03-27","20":"2016-04-21","21":"2016-05-01","22":"2016-05-26","23":"2016-09-07","24":"2016-10-12","25":"2016-11-02","26":"2016-11-15","27":"2016-12-25"},"event_name":{"0":"Easter","1":"Easter","2":"Ano novo","3":"Carnaval","4":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","5":"Sexta-feira Santa","6":"P\\u00e1scoa","7":"Tiradentes","8":"Dia Mundial do Trabalho","9":"Corpus Christi","10":"Independ\\u00eancia do Brasil","11":"Nossa Senhora Aparecida","12":"Finados","13":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","14":"Natal","15":"Ano novo","16":"Carnaval","17":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","18":"Sexta-feira Santa","19":"P\\u00e1scoa","20":"Tiradentes","21":"Dia Mundial do Trabalho","22":"Corpus Christi","23":"Independ\\u00eancia do Brasil","24":"Nossa Senhora Aparecida","25":"Finados","26":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","27":"Natal"}}')
dfHolidays
+------------+--------------+
| holiday_dt | holiday_name |
+------------+--------------+
| 2015-04-05 | Easter       |
| 2016-03-27 | Easter       |
| 2015-01-01 | Ano novo     |
| 2015-02-17 | Carnaval     |
+------------+--------------+

I'd like to add features to the main dataframe df

Also, not just next any holiday but each specific one listed

Upvotes: 1

Views: 49

Answers (2)

Kuldip Chaudhari
Kuldip Chaudhari

Reputation: 1112

Make sure you are using pandas version 1.1.0 and Try this

dfHolidays = pd.read_json('{"event_dt":{"0":"2015-04-05","1":"2016-03-27","2":"2015-01-01","3":"2015-02-17","4":"2015-02-18","5":"2015-04-03","6":"2015-04-05","7":"2015-04-21","8":"2015-05-01","9":"2015-06-04","10":"2015-09-07","11":"2015-10-12","12":"2015-11-02","13":"2015-11-15","14":"2015-12-25","15":"2016-01-01","16":"2016-02-09","17":"2016-02-10","18":"2016-03-25","19":"2016-03-27","20":"2016-04-21","21":"2016-05-01","22":"2016-05-26","23":"2016-09-07","24":"2016-10-12","25":"2016-11-02","26":"2016-11-15","27":"2016-12-25"},"event_name":{"0":"Easter","1":"Easter","2":"Ano novo","3":"Carnaval","4":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","5":"Sexta-feira Santa","6":"P\\u00e1scoa","7":"Tiradentes","8":"Dia Mundial do Trabalho","9":"Corpus Christi","10":"Independ\\u00eancia do Brasil","11":"Nossa Senhora Aparecida","12":"Finados","13":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","14":"Natal","15":"Ano novo","16":"Carnaval","17":"Quarta-feira de cinzas (In\\u00edcio da Quaresma)","18":"Sexta-feira Santa","19":"P\\u00e1scoa","20":"Tiradentes","21":"Dia Mundial do Trabalho","22":"Corpus Christi","23":"Independ\\u00eancia do Brasil","24":"Nossa Senhora Aparecida","25":"Finados","26":"Proclama\\u00e7\\u00e3o da Rep\\u00fablica","27":"Natal"}}')
dfHolidays.event_dt = dfHolidays.event_dt.apply(lambda d: dt.datetime.strptime(d,'%Y-%m-%d'))

df = pd.read_json('{"dt":{"0":"2016-05-12","1":"2016-06-01","2":"2016-05-02","3":"2016-05-11","4":"2016-05-30","5":"2016-05-19","6":"2016-05-25","7":"2016-05-09","8":"2016-05-03","9":"2016-05-18"}}')
df.dt = df.dt.apply(lambda d: dt.datetime.strptime(d,'%Y-%m-%d'))

df[['days_until_next_holiday', 'days_since_last_holiday']] = None
df['days_until_next_' + dfHolidays.event_name.values]=None
df['days_since_last_' + dfHolidays.event_name.values]=None

for i, r in df.iterrows():
    df.loc[i, 'days_until_next_' + dfHolidays.event_name] = (dfHolidays.event_dt - r['dt']).apply(lambda x: x.days if x.days>0 else None).to_list()
    df.loc[i, 'days_since_last_' + dfHolidays.event_name] = (r['dt'] - dfHolidays.event_dt).apply(lambda x: x.days if x.days>0 else None).to_list()
    df.loc[i, 'days_until_next_holiday'] = df.loc[i, 'days_until_next_' + dfHolidays.event_name].min()
    df.loc[i, 'days_since_last_holiday'] = df.loc[i, 'days_since_last_' + dfHolidays.event_name].min()

result=pd.DataFrame()
for c in df.columns:
    if not df[c].isna().all():
        result[c] = df[c]
df = result

print(df)

Upvotes: 0

ALollz
ALollz

Reputation: 59519

You can use pd.merge_asof to grab the closest date in the future or past to create all of these columns. Essentially you need to change the direction and/or the susbet of dfHolidays you use in the merge:

Prepare Data

df = df.sort_values('dt')
dfHolidays = dfHolidays.sort_values('event_dt')

df['dt'] = pd.to_datetime(df['dt'])
dfHolidays['event_dt'] = pd.to_datetime(dfHolidays['event_dt'])

merge_asof

def calc_days(df, dfHolidays, direction):
    """ Returns Series of integer # of days between events. Absolute value """
    s = pd.merge_asof(df, dfHolidays, left_on='dt', right_on='event_dt',
                      direction=direction)
    s = (s['event_dt'] - s['dt']).dt.days.abs()
    return s

df['until_next'] = calc_days(df, dfHolidays, 'forward')
df['since_prev'] = calc_days(df, dfHolidays, 'backward')
df['until_next_Ano'] = calc_days(df, dfHolidays[dfHolidays['event_name'].eq('Ano novo')], 'forward')
df['since_prev_Ano'] = calc_days(df, dfHolidays[dfHolidays['event_name'].eq('Ano novo')], 'backward')
#...

print(df)
          dt  until_next  since_prev  until_next_Ano  since_prev_Ano
2 2016-05-02          17           8             NaN             129
8 2016-05-03         100           4             NaN             150
7 2016-05-09           1          24             NaN             145
3 2016-05-11          15          10             NaN             131
0 2016-05-12          24           1             NaN             122
9 2016-05-18          98           6             NaN             152
5 2016-05-19           8          17             NaN             138
6 2016-05-25           7          18             NaN             139
4 2016-05-30          14          11             NaN             132
1 2016-06-01          23           2             NaN             123

As your provided dfHoliday had no future Ano novo dates compared with the dates in df, the future gets set to NaN, as nothing merges:

dfHolidays[dfHolidays['event_name'].eq('Ano novo')]
#     event_dt event_name
#2  2015-01-01   Ano novo
#15 2016-01-01   Ano novo

Upvotes: 2

Related Questions