Gustavo
Gustavo

Reputation: 77

Python Pandas filter rows by days of difference in two columns with weekend and holidays

I have a dataframe with two dates, among other things. I need to filter out rows that have more than two working days difference between these two dates. I must take into consideration weekends and holidays.

*Assuming 10/17/2023 is a holiday...

Example df:

NAME     DATE1         DATE2
CASE1    10/12/2023    10/13/2023  <--- one day difference
CASE2    10/12/2023    10/16/2023  <--- two days difference (weekend)
CASE3    10/12/2023    10/18/2023  <--- three days difference (weekends and holidays)
...
CASEX    10/12/2023    10/19/2023  <--- four days difference (weekends and holidays)

I need to save CASE3 and CASEX (which has more than two days difference) in another dataframe and delete it from this one.

My approach:

date1 = "10/12/2023"
date2 = "10/19/2023"
date1 = pd.to_datetime(date1, format="%m/%d/%Y").date()
date2 = pd.to_datetime(date2, format="%m/%d/%Y").date()
holidays = [pd.to_datetime("10/17/2023",format="%m/%d/%Y").date()]
days = np.busday_count(date1, date2, holidays=holidays)

In "days" I have the correct number. But I don't get to implement it in dataframe to filter it and extract the rows.

Upvotes: 4

Views: 78

Answers (2)

Panda Kim
Panda Kim

Reputation: 13257

Code

You can use the apply() function to count the number of bdate_range values in each row.

cond = df.apply(lambda x: len(pd.bdate_range(x['DATE1'], x['DATE2'], holidays=['2023-10-17'], freq='C')), axis=1).gt(3)
df[cond]

output:

    NAME    DATE1       DATE2
2   CASE3   10/12/2023  10/18/2023
3   CASE4   10/12/2023  10/19/2023

Example

import pandas as pd
data = {'NAME': ['CASE1', 'CASE2', 'CASE3', 'CASE4'], 
        'DATE1': ['10/12/2023', '10/12/2023', '10/12/2023', '10/12/2023'], 
        'DATE2': ['10/13/2023', '10/16/2023', '10/18/2023', '10/19/2023']}
df = pd.DataFrame(data)

df

    NAME    DATE1       DATE2
0   CASE1   10/12/2023  10/13/2023
1   CASE2   10/12/2023  10/16/2023
2   CASE3   10/12/2023  10/18/2023
3   CASE4   10/12/2023  10/19/2023

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71707

Code

Define a list of holidays

holidays = np.array([pd.to_datetime("10/17/2023", format="%m/%d/%Y")], dtype='datetime64[D]')

Parse the strings in date columns to datetime type

df['DATE1'] = pd.to_datetime(df['DATE1'], format="%m/%d/%Y")
df['DATE2'] = pd.to_datetime(df['DATE2'], format="%m/%d/%Y")

#     NAME      DATE1      DATE2
# 0  CASE1 2023-10-12 2023-10-13
# 1  CASE2 2023-10-12 2023-10-16
# 2  CASE3 2023-10-12 2023-10-18
# 3  CASEX 2023-10-12 2023-10-19

Cast the dates to datetime64[D] types then use busy_day count to get the diff

days = np.busday_count(df['DATE1'].values.astype("datetime64[D]"), 
                       df['DATE2'].values.astype("datetime64[D]"), 
                       holidays=holidays)

# array([1, 2, 3, 4])

Use boolean indexing to filter the rows

valid_rows = df[days <= 2]
invalid_rows = df[days > 2]

# valid_rows 
#     NAME      DATE1      DATE2
# 0  CASE1 2023-10-12 2023-10-13
# 1  CASE2 2023-10-12 2023-10-16

# invalid_rows 
#     NAME      DATE1      DATE2
# 2  CASE3 2023-10-12 2023-10-18
# 3  CASEX 2023-10-12 2023-10-19

Upvotes: 2

Related Questions