Iyad Al aqel
Iyad Al aqel

Reputation: 2022

pass multiple params to a groupby function in pandas

I wanna do a custom function over a groupby, so for example if my data has the following format.

personid  jobid  start_date  end_date
1           1    2015-01-01  2016-01-30 
1           2    2016-01-01  2017-01-01

I wanna compute the overlap between the two dates of the two different jobs for the same person. Would it be wise to use

df.groupby(personid).agg(x)

But then how would i reference both start date and end date for different records in the function x.

The output of code would be something like

personid  overlap
1         30

Upvotes: 0

Views: 70

Answers (1)

jezrael
jezrael

Reputation: 863301

I think you need groupby with custom function where select first and last value of start and end datetime, get date_range and then find length of intersection by numpy.intersect1d:

def f(x):
    a = pd.date_range(x['start_date'].iat[0], x['end_date'].iat[0], unit='d')
    b = pd.date_range(x['start_date'].iat[-1], x['end_date'].iat[-1], unit='d')
    return pd.Series(len(np.intersect1d(a,b)), index=['overlap'])

df = df.groupby('personid').apply(f).reset_index()
print (df)
   personid  overlap
0         1      366
1         2        6

Sample:

df = pd.DataFrame({'start_date': [pd.Timestamp('2015-01-01 00:00:00'), pd.Timestamp('2015-01-01 00:00:00'), pd.Timestamp('2015-01-01 00:00:00'), pd.Timestamp('2015-01-05 00:00:00')], 'personid': [1, 1, 2, 2], 'end_date': [pd.Timestamp('2016-01-30 00:00:00'), pd.Timestamp('2016-01-01 00:00:00'), pd.Timestamp('2015-01-25 00:00:00'), pd.Timestamp('2015-01-10 00:00:00')], 'jobid': [1, 2, 1, 2]})
print (df)
    end_date  jobid  personid start_date
0 2016-01-30      1         1 2015-01-01
1 2016-01-01      2         1 2015-01-01
2 2015-01-25      1         2 2015-01-01
3 2015-01-10      2         2 2015-01-05

Upvotes: 2

Related Questions