Reputation: 366
Be the following python pandas DataFrame:
| num_ID | start_date | end_date | time |
| ------ | ----------- | ---------- | ----------------- |
| 1 | 2022-02-10 | 2022-02-11 | 0 days 09:23:00 |
| 1 | 2022-02-12 | 2022-02-15 | 2 days 12:23:00 |
| 2 | 2022-02-12 | 2022-02-15 | 2 days 10:23:00 |
| 2 | 2022-02-05 | 2022-02-27 | 22 days 02:35:00 |
| 3 | 2022-02-04 | 2022-02-06 | 1 days 19:55:00 |
| 3 | 2022-02-12 | 2022-02-15 | 2 days 05:21:00 |
| 3 | 2022-02-12 | 2022-02-15 | 2 days 05:15:00 |
And the following DataFrame containing consecutive dates with their respective holiday values in the is_holiday
column.
| date | is_holiday | name | other |
| ---------- | ---------- | ---- | ----- |
| 2022-01-01 | True | ABC | red |
| 2022-01-02 | False | CNA | blue |
...
# we assume in this case that the omitted rows have the value False in column
| 2022-02-15 | True | OOO | red |
| 2022-02-16 | True | POO | red |
| 2022-02-17 | False | KTY | blue |
...
| 2023-12-30 | False | TTE | white |
| 2023-12-31 | True | VVV | red |
I want to add a new column total_days
to the initial DataFrame that indicates the total holidays marked True in second DataFrame that each row passes between the two dates (start_date
and end_date
).
Output result example:
| num_ID | start_date | end_date | time | total_days |
| ------ | ----------- | ---------- | ----------------- | -------------- |
| 1 | 2022-02-10 | 2022-02-11 | 0 days 09:23:00 | 0 |
| 1 | 2022-02-12 | 2022-02-15 | 2 days 12:23:00 | 1 |
| 2 | 2022-02-12 | 2022-02-15 | 2 days 10:23:00 | 1 |
| 2 | 2022-02-05 | 2022-02-27 | 22 days 02:35:00 | 2 |
| 3 | 2022-02-04 | 2022-02-06 | 1 days 19:55:00 | 0 |
| 3 | 2022-02-12 | 2022-02-15 | 2 days 05:21:00 | 1 |
| 3 | 2022-02-12 | 2022-02-15 | 2 days 05:15:00 | 1 |
Edit: The solution offered by @jezrael adds more days by grouping by previous intervals. Wrong result:
| num_ID | start_date | end_date | time | total_days |
| ------ | ----------- | ---------- | ----------------- | -------------- |
| 1 | 2022-02-10 | 2022-02-11 | 0 days 09:23:00 | 0 |
| 1 | 2022-02-12 | 2022-02-15 | 2 days 12:23:00 | 3 |
| 2 | 2022-02-12 | 2022-02-15 | 2 days 10:23:00 | 3 |
| 2 | 2022-02-05 | 2022-02-27 | 22 days 02:35:00 | 2 |
| 3 | 2022-02-04 | 2022-02-06 | 1 days 19:55:00 | 0 |
| 3 | 2022-02-12 | 2022-02-15 | 2 days 05:21:00 | 3 |
New Edit: The new solution offered by @jezrael offers another error:
| num_ID | start_date | end_date | time | total_days |
| ------ | ----------- | ---------- | ----------------- | -------------- |
| 1 | 2022-02-10 | 2022-02-11 | 0 days 09:23:00 | 0 |
| 1 | 2022-02-12 | 2022-02-15 | 2 days 12:23:00 | 1 |
| 2 | 2022-02-12 | 2022-02-15 | 2 days 10:23:00 | 1 |
| 2 | 2022-02-05 | 2022-02-27 | 22 days 02:35:00 | 2 |
| 3 | 2022-02-04 | 2022-02-06 | 1 days 19:55:00 | 0 |
| 3 | 2022-02-12 | 2022-02-15 | 2 days 05:21:00 | 2 |
| 3 | 2022-02-12 | 2022-02-15 | 2 days 05:15:00 | 2 |
Upvotes: 2
Views: 381
Reputation: 13582
Option 1
There are various ways to achieve OP's goal. One would be using .apply()
with a custom lambda function as follows
df['total_days'] = df.apply(lambda x: df2[(df2['date'] >= x['start_date']) & (df2['date'] <= x['end_date']) & (df2['is_holiday'] == True)].shape[0], axis=1)
[Out]:
num_ID start_date end_date time total_days
0 1 2022-02-10 2022-02-11 0 days 09:23:00 0
1 1 2022-02-12 2022-02-15 2 days 12:23:00 1
2 2 2022-02-12 2022-02-15 2 days 10:23:00 1
3 2 2022-02-05 2022-02-27 22 days 02:35:00 2
4 3 2022-02-04 2022-02-06 1 days 19:55:00 0
5 3 2022-02-12 2022-02-15 2 days 05:21:00 1
6 3 2022-02-12 2022-02-15 2 days 05:15:00 1
Option 2
Assuming one doesn't want to use .apply()
(see the last note below), one can use a list comprehension with .query()
and .itertuples()
df['total_days'] = [len([i for i in df2.query('date >= @x.start_date and date <= @x.end_date and is_holiday == True').index]) for x in df.itertuples()]
[Out]:
num_ID start_date end_date time total_days
0 1 2022-02-10 2022-02-11 0 days 09:23:00 0
1 1 2022-02-12 2022-02-15 2 days 12:23:00 1
2 2 2022-02-12 2022-02-15 2 days 10:23:00 1
3 2 2022-02-05 2022-02-27 22 days 02:35:00 2
4 3 2022-02-04 2022-02-06 1 days 19:55:00 0
5 3 2022-02-12 2022-02-15 2 days 05:21:00 1
6 3 2022-02-12 2022-02-15 2 days 05:15:00 1
Option 3
Another option, would be with .apply()
and numpy.sum
as follows
import numpy as np
df['total_days'] = df.apply(lambda x: np.sum((df2['date'] >= x['start_date']) & (df2['date'] <= x['end_date']) & (df2['is_holiday'] == True)), axis=1)
[Out]:
num_ID start_date end_date time total_days
0 1 2022-02-10 2022-02-11 0 days 09:23:00 0
1 1 2022-02-12 2022-02-15 2 days 12:23:00 1
2 2 2022-02-12 2022-02-15 2 days 10:23:00 1
3 2 2022-02-05 2022-02-27 22 days 02:35:00 2
4 3 2022-02-04 2022-02-06 1 days 19:55:00 0
5 3 2022-02-12 2022-02-15 2 days 05:21:00 1
6 3 2022-02-12 2022-02-15 2 days 05:15:00 1
Notes:
that each row passes between the two dates
For that in Options 1 and 3, one is doing the following:
(df2['date'] >= x['start_date'])
- indicates that the date
in df2
is greater than or equal to the start_date
in df
.
(df2['date'] <= x['end_date'])
- means that the date
in df2
is less than or equal to the end_date
in df
.
holidays marked True in second DataFrame
And for that, in option 1 and 3, one is using:
(df2['is_holiday'] == True)
pandas.DataFrame.shape
returns a tuple representing the dimensionality of a given DataFrame. For this specific case one wants only the first element of it, so one uses .shape[0]
.
There are strong opinions on the usage of .apply()
. Therefore, would suggest one to read this: When should I (not) want to use pandas apply() in my code?
Upvotes: 0
Reputation: 763
Let me simplify the example and drop uninvolved columns:
df = pd.DataFrame({
'num_ID': [1,1,2,2,3,3,3],
'start_date': ['2022-02-10', '2022-02-12', '2022-02-12', '2022-02-05', '2022-02-04', '2022-02-12', '2022-02-12'],
'end_date': ['2022-02-11', '2022-02-15', '2022-02-15', '2022-02-27', '2022-02-06', '2022-02-15', '2022-02-15']
})
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df_dates = pd.DataFrame({
'date': pd.date_range('2022-01-01', '2023-12-31', freq='D'),
'is_holiday': [False] * 730
})
df_dates.loc[
df_dates['date'].isin(pd.to_datetime(['2022-01-01', '2022-02-15', '2022-02-16', '2023-12-31'])),
'is_holiday'
] = True
Now we have df
which is initial DataFrame and df_dates
- the DataFrame with information about the holidays.
First step is to set the date as an index in the second DataFrame:
df_dates.set_index('date', inplace=True)
df_dates.sort_index(inplace=True) # just in case the dates are not ordered
When it's done, we can get the desired metric applying simple function to each row:
def get_n_holidays(row):
# The function is pretty simple, feel free to replace it with lambda
return df_dates.loc[row['start_date']:row['end_date'], 'is_holiday'].sum()
df['total_days'] = df.apply(get_n_holidays, axis=1)
The result:
num_ID start_date end_date total_days
0 1 2022-02-10 2022-02-11 0
1 1 2022-02-12 2022-02-15 1
2 2 2022-02-12 2022-02-15 1
3 2 2022-02-05 2022-02-27 2
4 3 2022-02-04 2022-02-06 0
5 3 2022-02-12 2022-02-15 1
6 3 2022-02-12 2022-02-15 1
NOTE: the solution assumes that all dates within the given date range are presented in df_dates
Upvotes: 0
Reputation: 862641
EDIT: Because need count per rows separately matched date
s create date_range
and count matched values by Index.isin
with sum
:
L = df1.loc[df1['is_holiday'], 'date'].tolist()
df['total_holidays'] = [pd.date_range(s, e).isin(L).sum()
for s, e in zip(df['start_date'], df['end_date'])]
print (df)
num_ID start_date end_date time total_holidays
0 1 2022-02-10 2022-02-11 0 days 09:23:00 0
1 1 2022-02-12 2022-02-15 2 days 12:23:00 1
2 2 2022-02-12 2022-02-15 2 days 10:23:00 1
3 2 2022-02-05 2022-02-27 2 days 02:35:00 2
4 3 2022-02-04 2022-02-06 1 days 19:55:00 0
5 3 2022-02-12 2022-02-15 2 days 05:21:00 1
6 3 2022-02-12 2022-02-15 2 days 05:21:00 1
Another idea with length of indices after Index.intersection
:
L = df1.loc[df1['is_holiday'], 'date'].tolist()
df['total_holidays'] = [len(pd.date_range(s, e).intersection(L))
for s, e in zip(df['start_date'], df['end_date'])]
print (df)
num_ID start_date end_date time total_holidays
0 1 2022-02-10 2022-02-11 0 days 09:23:00 0
1 1 2022-02-12 2022-02-15 2 days 12:23:00 1
2 2 2022-02-12 2022-02-15 2 days 10:23:00 1
3 2 2022-02-05 2022-02-27 2 days 02:35:00 2
4 3 2022-02-04 2022-02-06 1 days 19:55:00 0
5 3 2022-02-12 2022-02-15 2 days 05:21:00 1
6 3 2022-02-12 2022-02-15 2 days 05:21:00 1
Or intersection of sets:
sets = set(df1.loc[df1['is_holiday'], 'date'])
df['total_holidays'] = [len(set(pd.date_range(s, e)) & sets)
for s, e in zip(df['start_date'], df['end_date'])]
print (df)
num_ID start_date end_date time total_holidays
0 1 2022-02-10 2022-02-11 0 days 09:23:00 0
1 1 2022-02-12 2022-02-15 2 days 12:23:00 1
2 2 2022-02-12 2022-02-15 2 days 10:23:00 1
3 2 2022-02-05 2022-02-27 2 days 02:35:00 2
4 3 2022-02-04 2022-02-06 1 days 19:55:00 0
5 3 2022-02-12 2022-02-15 2 days 05:21:00 1
6 3 2022-02-12 2022-02-15 2 days 05:21:00 1
Upvotes: 1
Reputation: 28644
If your data is small, a cartesian join is fine; as your data increases, it becomes inefficient, as you are comparing every row between both dataframes. A better way is to use some form of binary search, to get your matches - conditional_join from pyjanitor offers an efficient way for non-equi joins:
# pip install pyjanitor
# you can install the dev version for latest improvements
# pip install git + https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df2.date = pd.to_datetime(df2.date)
# relevant columns
cols = [*df.columns, 'is_holiday']
out = (df
.conditional_join(
df2.loc[df2.is_holiday == "True"],
('start_date', 'date', '<='),
('end_date', 'date', '>='),
how = 'inner')
.loc(axis = 1)[cols]
.groupby(cols[:-1])
.size()
.rename('total_days')
)
Merge back to the original dataframe to get the final output
(df
.merge(out, how = 'left', on = cols[:-1])
# fillna is faster on a Series
.assign(total_days = lambda df: df.total_days.fillna(0, downcast = 'infer'))
)
num_ID start_date end_date time total_days
0 1 2022-02-14 2022-02-15 0 days 09:23:00 1
1 2 2022-02-12 2022-02-15 2 days 10:23:00 1
2 2 2022-02-05 2022-02-27 22 days 02:35:00 2
3 3 2022-02-04 2022-02-06 1 days 19:55:00 0
With the dev version, you could preselect columns and also possibly avoid the merge back to the original dataframe. At any rate, for performance, if you can, avoid a cross join.
Upvotes: 1