Reputation: 786
I have a dataframe with some IDs and a few start dates and end dates corresponding to that id, that looks like,
df = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'start_date': ['2016-07-27 16:07:00','2016-10-20 08:10:00','2016-12-08 10:12:00','2017-07-16 11:07:00','2017-07-16 16:07:00','2016-07-27 16:07:00','2016-10-20 08:10:00','2016-12-08 10:12:00','2017-07-16 11:07:00','2017-07-16 16:07:00'],
'end_date': ['2016-07-29 15:07:00','2017-08-10 07:04:00','2017-03-07 12:03:00','2017-07-18 11:07:00','2017-09-20 12:09:00','2016-07-29 15:07:00','2017-08-10 07:04:00','2017-03-07 12:03:00','2017-07-18 11:07:00','2017-09-20 12:09:00']})
id start_date end_date
1 2016-07-27 16:07:00 2016-07-29 15:07:00
1 2016-10-20 08:10:00 2017-08-10 07:04:00
1 2016-12-08 10:12:00 2017-03-07 12:03:00
1 2017-07-16 11:07:00 2017-07-18 11:07:00
1 2017-07-16 16:07:00 2017-09-20 12:09:00
2 2016-07-27 16:07:00 2016-07-29 15:07:00
2 2016-10-20 08:10:00 2017-08-10 07:04:00
2 2016-12-08 10:12:00 2017-03-07 12:03:00
2 2017-07-16 11:07:00 2017-07-18 11:07:00
2 2017-07-16 16:07:00 2017-09-20 12:09:00
I want to see if the start_date of a row is between any another start and end date for that same id and with how many records the start_date is in between the start and end date. The resulting output should look like,
id start_date end_date count_col
1 2016-07-27 16:07:00 2016-07-29 15:07:00 0
1 2016-10-20 08:10:00 2017-08-10 07:04:00 0
1 2016-12-08 10:12:00 2017-03-07 12:03:00 1
1 2017-07-16 11:07:00 2017-07-18 11:07:00 1
1 2017-07-16 16:07:00 2017-09-20 12:09:00 2
2 2016-07-27 16:07:00 2016-07-29 15:07:00 0
2 2016-10-20 08:10:00 2017-08-10 07:04:00 0
2 2016-12-08 10:12:00 2017-03-07 12:03:00 1
2 2017-07-16 11:07:00 2017-07-18 11:07:00 1
2 2017-07-16 16:07:00 2017-09-20 12:09:00 2
I tried something like,
def start_date_compare(start_date):
date_within = df.apply(lambda x: ((x['start_date']<=start_date) & (x['end_date']>=start_date)), axis=1)
return date_within.sum(axis=0)
df['count_col'] = df.groupby(['id'])['start_date'].apply(lambda x: start_date_compare(x))
But this checks a row with itself too, and also doesn't compare with the same id alone.
Upvotes: 2
Views: 208
Reputation: 862441
There is always added original row, so only subtract 1
, also lambda function here is not necessary:
EDIT:
For test values per groups use:
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
def start_date_compare(subdf):
date_within = subdf.apply(lambda x: ((x['start_date']<=subdf['start_date']) &
(x['end_date']>=subdf['start_date'])), axis=1)
subdf['count_col'] = date_within.sum(axis=0) - 1
return subdf
df = df.groupby('id').apply(start_date_compare)
print (df)
id start_date end_date count_col
0 1 2016-07-27 16:07:00 2016-07-29 15:07:00 0
1 1 2016-10-20 08:10:00 2017-08-10 07:04:00 0
2 1 2016-12-08 10:12:00 2017-03-07 12:03:00 1
3 1 2017-07-16 11:07:00 2017-07-18 11:07:00 1
4 1 2017-07-16 16:07:00 2017-09-20 12:09:00 2
5 2 2016-07-27 16:07:00 2016-07-29 15:07:00 0
6 2 2016-10-20 08:10:00 2017-08-10 07:04:00 0
7 2 2016-12-08 10:12:00 2017-03-07 12:03:00 1
8 2 2017-07-16 11:07:00 2017-07-18 11:07:00 1
9 2 2017-07-16 16:07:00 2017-09-20 12:09:00 2
Upvotes: 3