Reputation: 6132
I have the following problem. I've got a dataframe with start and end dates for each group. There might be more than one start and end date per group, like this:
group start_date end_date
1 2020-01-03 2020-03-03
1 2020-05-03 2020-06-03
2 2020-02-03 2020-06-03
And another dataframe with one row per date, per group, like this:
group date
1 2020-01-03
1 2020-02-03
1 2020-03-03
1 2020-04-03
1 2020-05-03
1 2020-06-03
2 2020-02-03
3 2020-03-03
4 2020-04-03
.
.
So I want to create a column is_between
in an efficient way, ideally avoiding loops, so I get the following dataframe
group date is_between
1 2020-01-03 1
1 2020-02-03 1
1 2020-03-03 1
1 2020-04-03 0
1 2020-05-03 1
1 2020-06-03 1
2 2020-02-03 1
3 2020-03-03 1
4 2020-04-03 1
.
.
So it gets a 1 when a group's date is between the dates in the first dataframe. I'm guessing some combination of groupby
, where
, between
and maybe map
might do it, but I'm not finding the correct one. Any ideas?
Upvotes: 2
Views: 1513
Reputation: 29635
you could try with merge_asof
, by
the group and on
the date and start_date, then check where the date is less than end_date and finally assign back to the original df2
ser = (pd.merge_asof(df2.reset_index() #for later index alignment
.sort_values('date'),
df1.sort_values('start_date'),
by='group',
left_on='date', right_on='start_date',
direction='backward')
.assign(is_between=lambda x: x.date<=x.end_date)
.set_index(['index'])['is_between']
)
df2['is_between'] = ser.astype(int)
print (df2)
group date is_between
0 1 2020-01-03 1
1 1 2020-02-03 1
2 1 2020-03-03 1
3 1 2020-04-03 0
4 1 2020-05-03 1
5 1 2020-06-03 1
6 2 2020-02-03 1
7 3 2020-03-03 0
8 4 2020-04-03 0
Upvotes: 1
Reputation: 6132
Based on @YOBEN_S and @Quang Hoang's advice this made it:
df = df.merge(dic_dates, how='left')
df['is_between'] = np.where(df.date.between(pd.to_datetime(df.start_date),
pd.to_datetime(df.end_Date)),1, 0)
df = (df.sort_values(by=['group', 'date', 'is_between'])
.drop_duplicates(subset=['group', 'date'], keep='last'))
Upvotes: 2