Reputation: 351
I am using pandas dataframe for the following:
I am trying to find the best way to determine the total time spent by a ship at a particular berth taking into account overlap in the duration of the visit. Here is what the data looks like:
IN OUT BERTH
2015-01-14 13:57:00 2015-01-15 17:15:00 01
2015-01-14 14:30:00 2015-01-15 02:50:00 01
2015-01-14 14:30:00 2015-01-16 06:10:00 01
2015-01-25 02:15:00 2015-01-26 13:41:00 01
what exactly i want is to find out is the total time a particular berth was used. So looking at the data there are overlaps and so i cant simply add the times for each record.
looking at the above data we can see that the 2nd ship timing is completely within the first so the time recorded for that would be 0, and that the 3rd ship comes before the first but stays till after the 1st one leaves so here the time would be = (out of 3rd ship - in of 1st) and then we would move to the next one as there is no overlap there and simply add [out of 4 - in of 4] to the total time spent on berth 1, and continue this till the end producing a something like this :
BERTH HOURS WORKED
01 7.750
02 10.275
03 5.585
08 31.980
Upvotes: 1
Views: 408
Reputation: 57033
Here's a solution for one berth. I hope you can expand it to multiple berths.
Split the dataframe into arrivals and departures:
dfIN = df[['IN']]
dfOUT = df[['OUT']]
dfIN['direction'] = 1
dfOUT['diretcion'] = -1
As of now, IN and OUT are just timestamps:
dfIN.columns = ('TS', 'direction')
dfOUT.columns = ('TS', 'direction')
Combine the two parts into one tall sorted dataframe of traffic:
traffic = pd.concat([dfIN, dfOUT]).sort_values('TS')
# TS direction
#0 2015-01-14 13:57:00 1
#1 2015-01-14 14:30:00 1
#2 2015-01-14 14:30:00 1
#1 2015-01-15 02:50:00 -1
#0 2015-01-15 17:15:00 -1
#2 2015-01-16 06:10:00 -1
#3 2015-01-25 02:15:00 1
#3 2015-01-26 13:41:00 -1
Calculate the number of ships at the berth, as they arrive and depart:
traffic['ships'] = traffic['direction'].cumsum()
Identify time periods when the berth is empty. Then calculate the sequence number of each "busy period."
traffic['empty'] = (traffic['ships'] == 0).shift().fillna(0).astype(int)
traffic['busy_id'] = traffic['empty'].cumsum()
# TS direction ships empty busy_id
#0 2015-01-14 13:57:00 1 1 0 0
#1 2015-01-14 14:30:00 1 2 0 0
#2 2015-01-14 14:30:00 1 3 0 0
#1 2015-01-15 02:50:00 -1 2 0 0
#0 2015-01-15 17:15:00 -1 1 0 0
#2 2015-01-16 06:10:00 -1 0 0 0
#3 2015-01-25 02:15:00 1 1 1 1
#3 2015-01-26 13:41:00 -1 0 0 1
Calculate the start and end of each "busy period":
busy_data = traffic.groupby('busy_id')['TS'].agg([min, max])
# min max
#busy_id
#0 2015-01-14 13:57:00 2015-01-16 06:10:00
#1 2015-01-25 02:15:00 2015-01-26 13:41:00
Calculate the total length of all "busy periods":
(busy_data['max'] - busy_data['min']).sum()
#Timedelta('3 days 03:39:00')
Upvotes: 1