Reputation: 157
I have the following dataset,
D_DATE BIN Number Disposition Unit Assigned
2018-01-04 10005 SWO Issued PLUMBING DIVISION
2016-06-23 10005 SWO Issued SCAFFOLD UNIT
2016-06-23 10005 SWO Rescinded SCAFFOLD UNIT
2018-01-17 10005 SWO Rescinded PLUMBING DIVISION
2019-01-04 10006 SWO Rescinded BEST SQUAD
2018-12-21 10006 SWO Issued BEST SQUAD
I want to create a new field name "duration" which will capture the time difference between SWO issued and rescinded for each BIN number. Note that, each BIN number can show up multiple times based on Date and different Unit. So, each unit can issue SWO on the same BIN number.
I am trying to get a output like this which will calculate the date difference in days on the BIN number by Unit and based on SWO Issued and Rescinded.
Output:
D_DATE BIN Number Disposition Unit Assigned Duration
2018-01-04 10005 SWO Issued PLUMBING DIVISION 13 Days
2016-06-23 10005 SWO Issued SCAFFOLD UNIT 0 days
2016-06-23 10005 SWO Rescinded SCAFFOLD UNIT 0 days
2018-01-17 10005 SWO Rescinded PLUMBING DIVISION 13 days
2019-01-04 10006 SWO Rescinded BEST SQUAD 14 days
2018-12-21 10006 SWO Issued BEST SQUAD 14 days
I used this but it groupby on Unit rather than giving me individual BIN Number by each unit and their dispositions status,
df2 = df2.groupby('BIN Number')['D_DATE'].agg(['max','min'])
df2["Duration"] = df2['max'].sub(df2['min']).dt.days
print(df2)
Appreciate any help
Upvotes: 1
Views: 72
Reputation: 35205
Sort them with Disposition
and use shift() to extract the difference. It then extracts even-numbered lines and updates odd-numbered lines.
import pandas as pd
import io
data = '''
D_DATE "BIN Number" Disposition "Unit Assigned"
2018-01-04 10005 "SWO Issued" "PLUMBING DIVISION"
2016-06-23 10005 "SWO Issued" "SCAFFOLD UNIT"
2016-06-23 10005 "SWO Rescinded" "SCAFFOLD UNIT"
2018-01-17 10005 "SWO Rescinded" "PLUMBING DIVISION"
2019-01-04 10006 "SWO Rescinded" "BEST SQUAD"
2018-12-21 10006 "SWO Issued" "BEST SQUAD"
'''
df = pd.read_csv(io.StringIO(data), sep=' ')
df['D_DATE'] = pd.to_datetime(df['D_DATE'])
# upper code append
df.sort_values(['D_DATE', 'BIN Number', 'Disposition'], inplace=True)
df.reset_index(inplace=True)
df['Duration'] = df['D_DATE'] - df['D_DATE'].shift(1)
df['Duration'][::2] = df['Duration'][1::2]
df.sort_values('index', inplace=True)
df.set_index('index', drop=True)
D_DATE BIN Number Disposition Unit Assigned Duration
index
0 2018-01-04 10005 SWO Issued PLUMBING DIVISION 13 days
1 2016-06-23 10005 SWO Issued SCAFFOLD UNIT 0 days
2 2016-06-23 10005 SWO Rescinded SCAFFOLD UNIT 0 days
3 2018-01-17 10005 SWO Rescinded PLUMBING DIVISION 13 days
4 2019-01-04 10006 SWO Rescinded BEST SQUAD 14 days
5 2018-12-21 10006 SWO Issued BEST SQUAD 14 days
Upvotes: 1
Reputation: 23099
IIUC,
you need to groupby and use diff
to calculate the delta in time values,
you can then merge onto your original df.
m = df.groupby(['BIN Number','Disposition','Unit Assigned']).agg(date_min=('D_DATE','min'),
date_max=('D_DATE','max')).groupby(level=[0,2])\
.diff().reset_index().dropna().drop('Disposition',axis=1)
df2 = pd.merge(df,m,on=['BIN Number','Unit Assigned'],how='left')
print(df2)
D_DATE BIN Number Disposition Unit Assigned date_min date_max
0 2018-01-04 10005 SWO Issued PLUMBING DIVISION 13 days 13 days
1 2016-06-23 10005 SWO Issued SCAFFOLD UNIT 0 days 0 days
2 2016-06-23 10005 SWO Rescinded SCAFFOLD UNIT 0 days 0 days
3 2018-01-17 10005 SWO Rescinded PLUMBING DIVISION 13 days 13 days
4 2019-01-04 10006 SWO Rescinded BEST SQUAD 14 days 14 days
5 2018-12-21 10006 SWO Issued BEST SQUAD 14 days 14 days
Upvotes: 0