Find if there is any holidays between two dates in a large dataset?

Question

I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates. I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?

The dataset looks like this: Sample data

The code I am using is -

import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
    df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
    return df

df = mark_holiday(df)

Alexander · Accepted Answer

This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.

The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).

from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000  # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
                   'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)

# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())

# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
    df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))

>>> df.head(6)
  start_date   end_date  holiday_in_range
0 2015-07-14 2015-07-31             False
1 2010-12-18 2010-12-30              True  # 2010-12-24
2 2013-04-06 2013-04-16             False
3 2013-09-12 2013-09-24             False
4 2017-10-28 2017-10-31             False
5 2013-12-14 2013-12-29              True  # 2013-12-25

So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.

Find if there is any holidays between two dates in a large dataset?

Answers (2)

Related Questions