dark_shadow
dark_shadow

Reputation: 43

Find if there is any holidays between two dates in a large dataset?

I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates. I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?

The dataset looks like this: Sample data

The code I am using is -

import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
    df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
    return df

df = mark_holiday(df)

Upvotes: 1

Views: 2034

Answers (2)

jottbe
jottbe

Reputation: 4521

Have you already considered using pandas.merge_asof for this?

I could imagine that map and apply with lambda functions cannot be executed that efficiently.

UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.

If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):

  1. Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
  2. Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
  3. Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
  4. As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
  5. Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
  6. Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
  7. Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
  8. Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')

The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).

Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.

In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.

Upvotes: 1

Alexander
Alexander

Reputation: 109726

This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.

The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).

from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000  # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
                   'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)

# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())

# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
    df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))

>>> df.head(6)
  start_date   end_date  holiday_in_range
0 2015-07-14 2015-07-31             False
1 2010-12-18 2010-12-30              True  # 2010-12-24
2 2013-04-06 2013-04-16             False
3 2013-09-12 2013-09-24             False
4 2017-10-28 2017-10-31             False
5 2013-12-14 2013-12-29              True  # 2013-12-25

So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.

Upvotes: 1

Related Questions