Reputation: 2022
I wanna do a custom function over a groupby, so for example if my data has the following format.
personid jobid start_date end_date
1 1 2015-01-01 2016-01-30
1 2 2016-01-01 2017-01-01
I wanna compute the overlap between the two dates of the two different jobs for the same person. Would it be wise to use
df.groupby(personid).agg(x)
But then how would i reference both start date and end date for different records in the function x.
The output of code would be something like
personid overlap
1 30
Upvotes: 0
Views: 70
Reputation: 863301
I think you need groupby
with custom function where select first and last value of start
and end
datetime, get date_range
and then find length
of intersection by numpy.intersect1d
:
def f(x):
a = pd.date_range(x['start_date'].iat[0], x['end_date'].iat[0], unit='d')
b = pd.date_range(x['start_date'].iat[-1], x['end_date'].iat[-1], unit='d')
return pd.Series(len(np.intersect1d(a,b)), index=['overlap'])
df = df.groupby('personid').apply(f).reset_index()
print (df)
personid overlap
0 1 366
1 2 6
Sample:
df = pd.DataFrame({'start_date': [pd.Timestamp('2015-01-01 00:00:00'), pd.Timestamp('2015-01-01 00:00:00'), pd.Timestamp('2015-01-01 00:00:00'), pd.Timestamp('2015-01-05 00:00:00')], 'personid': [1, 1, 2, 2], 'end_date': [pd.Timestamp('2016-01-30 00:00:00'), pd.Timestamp('2016-01-01 00:00:00'), pd.Timestamp('2015-01-25 00:00:00'), pd.Timestamp('2015-01-10 00:00:00')], 'jobid': [1, 2, 1, 2]})
print (df)
end_date jobid personid start_date
0 2016-01-30 1 1 2015-01-01
1 2016-01-01 2 1 2015-01-01
2 2015-01-25 1 2 2015-01-01
3 2015-01-10 2 2 2015-01-05
Upvotes: 2