Reputation: 49
I have a pandas dataframe with over 100 timestamps that defines the non-working-time of a machine:
>>> off_time
date (index) start end
2020-07-04 18:00:00 23:50:00
2020-08-24 00:00:00 08:00:00
2020-08-24 14:00:00 16:00:00
2020-09-04 00:00:00 23:59:59
2020-10-05 18:00:00 22:00:00
I also have a second dataframe (called data) with over 1000 timestamps defining the duration of some processes:
>>> data
process-name start-time end-time duration
name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 day 14:00:00
name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 14:00:00
name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 12:00:00
name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 02:50:00
name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 day 09:00:00
In order to get the effective working time for each process in data, I now have to subtract the non-working time from the duration. For example, I have to subtract the time between 18 and 20 for the process "Name 5", since this time is planned as non-working time.
I wrote a code with many if-else conditions, which I see as a potential source of errors! Is there a clean way to calculate effective time without using too many if-else? Any help would be greatly appreciated.
Upvotes: 1
Views: 93
Reputation: 880
Set up sample data (I added a couple of rows to your samples to include some edge cases):
######### OFF TIMES
off = pd.DataFrame([
["2020-07-04", dt.time(18), dt.time(23,50)],
["2020-08-24", dt.time(0), dt.time(8)],
["2020-08-24", dt.time(14), dt.time(16)],
["2020-09-04", dt.time(0), dt.time(23,59,59)],
["2020-10-04", dt.time(15), dt.time(18)],
["2020-10-05", dt.time(18), dt.time(22)]], columns= ["date", "start", "end"])
off["date"] = pd.to_datetime(off["date"])
off = off.set_index("date")
### Convert start and end times to datetimes in UTC timezone, since that is much
### easier to handle and fits the other data
off["start"] = pd.to_datetime(off.index.astype("string") + " " + off.start.astype("string")+"+00:00")
off["end"] = pd.to_datetime(off.index.astype("string") + " " + off.end.astype("string")+"+00:00")
off
>>
start end
date
2020-07-04 2020-07-04 18:00:00+00:00 2020-07-04 23:50:00+00:00
2020-08-24 2020-08-24 00:00:00+00:00 2020-08-24 08:00:00+00:00
2020-08-24 2020-08-24 14:00:00+00:00 2020-08-24 16:00:00+00:00
2020-09-04 2020-09-04 00:00:00+00:00 2020-09-04 23:59:59+00:00
2020-10-04 2020-10-04 15:00:00+00:00 2020-10-04 18:00:00+00:00
2020-10-05 2020-10-05 18:00:00+00:00 2020-10-05 22:00:00+00:00
######### PROCESS TIMES
data = pd.DataFrame([
["name1","2020-07-17 08:00:00+00:00","2020-07-18 22:00:00+00:00"],
["name2","2020-08-24 01:00:00+00:00","2020-08-24 12:00:00+00:00"],
["name3","2020-09-20 07:00:00+00:00","2020-09-20 19:00:00+00:00"],
["name4","2020-09-04 16:00:00+00:00","2020-09-04 18:50:00+00:00"],
["name5","2020-10-04 11:00:00+00:00","2020-10-05 20:00:00+00:00"],
["name6","2020-09-03 10:00:00+00:00","2020-09-06 05:00:00+00:00"]
], columns = ["process", "start", "end"])
data["start"] = pd.to_datetime(data["start"])
data["end"] = pd.to_datetime(data["end"])
data["duration"] = data.end -data.start
data
>>
process start end duration
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00
As you can see, I added a row to off
on 2020-10-04, so that name5 has 2 off times, which could happen in your data and would need to be handled correctly. (this means that in the example in your question, you need to subtract 5 hours instead of 2)
I also added the process name6 which is multiple days long.
This is my solution, which will be applied to each row in data
def get_relevant_off(pr):
relevant = off[off.end.gt(pr["start"]) & off.start.lt(pr["end"])].copy()
if not relevant.empty:
relevant.loc[relevant["start"].lt(pr["start"]), "start"] = pr["start"]
relevant.loc[relevant["end"].gt(pr["end"]), "end"] = pr["end"]
to_subtract = (relevant.end - relevant.start).sum()
return pr["duration"] - to_subtract
else: return pr.duration
Explanation:
off
, based on the row pr
data["effective"] = data.apply(get_relevant_off, axis= 1)
data
>>
process start end duration effective
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00 0 days 04:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00 0 days 00:00:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00 1 days 04:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00 1 days 19:00:01
Caveat: I am assuming that off times never overlap. Also, I liked this problem, but don't have any more time to spend on testing this, so let me know if I overlooked some edge cases that break it and I will try to find the time to fix it.
Upvotes: 1