Reputation: 175
I have a dataframe called df that looks similar to this (except the number of 'Visit' columns goes up to Visit_74
and there are several hundred clients - I have simplified it here).
Client Visit_1 Visit_2 Visit_3 Visit_4 Visit_5
Client_1 2016-05-10 2016-05-25 2016-06-10 2016-06-25 2016-07-10
Client_2 2017-05-10 2017-05-25 2017-06-10 2017-06-25 2017-07-10
Client_3 2018-09-10 2018-09-26 2018-10-10 2018-10-26 2018-11-10
Client_4 2018-10-10 2018-10-26 2018-11-10 2018-11-26 2018-12-10
I want to create a new column called Four_Visits
with two values, 0
and 1
. I want to set Four_Visits
to equal 1
if there are at least four dates in any one of the columns from Visit_1
to Visit_5
that fall between 2018-10-15
and 2018-12-15
. The resulting dataframe should look like this:
Client Visit_1 Visit_2 Visit_3 Visit_4 Visit_5 Four_Visits
Client_1 2016-05-10 2016-05-25 2016-06-10 2016-06-25 2016-07-10 0
Client_2 2017-05-10 2017-05-25 2017-06-10 2017-06-25 2017-07-10 0
Client_3 2018-09-10 2018-09-26 2018-10-10 2018-10-26 2018-11-10 0
Client_4 2018-10-10 2018-10-26 2018-11-10 2018-11-26 2018-12-10 1
Upvotes: 1
Views: 55
Reputation: 59549
Convert to datetime
if not already, then use filter and >=
+ <=
to check if more than 4 visit columns fall between the dates for each row:
import pandas as pd
# df = df.set_index('Client').apply(pd.to_datetime).reset_index()
df['Four_Visits'] = ((df.filter(like='Visit').ge(pd.to_datetime('2018-10-15')).fillna(0).astype(bool))
& (df.filter(like='Visit').le(pd.to_datetime('2018-12-15')).fillna(0).astype(bool))
).sum(1).ge(4).astype('int')
Client Visit_1 Visit_2 Visit_3 Visit_4 Visit_5 Four_Visits
0 Client_1 2016-05-10 2016-05-25 2016-06-10 2016-06-25 2016-07-10 0
1 Client_2 2017-05-10 2017-05-25 2017-06-10 2017-06-25 2017-07-10 0
2 Client_3 2018-09-10 2018-09-26 2018-10-10 2018-10-26 2018-11-10 0
3 Client_4 2018-10-10 2018-10-26 2018-11-10 2018-11-26 2018-12-10 1
Upvotes: 1