Atefe Nematolahi
Atefe Nematolahi

Reputation: 49

calculate effective time of a process by subtracting non-working-time

I have a pandas dataframe with over 100 timestamps that defines the non-working-time of a machine:

>>> off_time

date (index)   start    end
2020-07-04    18:00:00  23:50:00    
2020-08-24    00:00:00  08:00:00
2020-08-24    14:00:00  16:00:00
2020-09-04    00:00:00  23:59:59
2020-10-05    18:00:00  22:00:00

I also have a second dataframe (called data) with over 1000 timestamps defining the duration of some processes:

>>> data

process-name          start-time                        end-time             duration
   name1     2020-07-17 08:00:00+00:00        2020-07-18 22:00:00+00:00     1 day 14:00:00
   name2     2020-08-24 01:00:00+00:00        2020-08-24 12:00:00+00:00     14:00:00
   name3     2020-09-20 07:00:00+00:00        2020-09-20 19:00:00+00:00     12:00:00
   name4     2020-09-04 16:00:00+00:00        2020-09-04 18:50:00+00:00     02:50:00
   name5     2020-10-04 11:00:00+00:00        2020-10-05 20:00:00+00:00     1 day 09:00:00

In order to get the effective working time for each process in data, I now have to subtract the non-working time from the duration. For example, I have to subtract the time between 18 and 20 for the process "Name 5", since this time is planned as non-working time.

I wrote a code with many if-else conditions, which I see as a potential source of errors! Is there a clean way to calculate effective time without using too many if-else? Any help would be greatly appreciated.

Upvotes: 1

Views: 93

Answers (1)

Stryder
Stryder

Reputation: 880

Set up sample data (I added a couple of rows to your samples to include some edge cases):

######### OFF TIMES
off = pd.DataFrame([
    ["2020-07-04",    dt.time(18),  dt.time(23,50)],
    ["2020-08-24",    dt.time(0),  dt.time(8)],
    ["2020-08-24",    dt.time(14),  dt.time(16)],
    ["2020-09-04",    dt.time(0),  dt.time(23,59,59)],
    ["2020-10-04", dt.time(15), dt.time(18)],
    ["2020-10-05",    dt.time(18),  dt.time(22)]], columns= ["date", "start", "end"])
off["date"] = pd.to_datetime(off["date"])
off = off.set_index("date")

### Convert start and end times to datetimes in UTC timezone, since that is much 
### easier to handle and fits the other data
off["start"] = pd.to_datetime(off.index.astype("string") + " " + off.start.astype("string")+"+00:00")
off["end"] = pd.to_datetime(off.index.astype("string") + " " + off.end.astype("string")+"+00:00")

off
>>
                               start                       end
date                                                          
2020-07-04 2020-07-04 18:00:00+00:00 2020-07-04 23:50:00+00:00
2020-08-24 2020-08-24 00:00:00+00:00 2020-08-24 08:00:00+00:00
2020-08-24 2020-08-24 14:00:00+00:00 2020-08-24 16:00:00+00:00
2020-09-04 2020-09-04 00:00:00+00:00 2020-09-04 23:59:59+00:00
2020-10-04 2020-10-04 15:00:00+00:00 2020-10-04 18:00:00+00:00
2020-10-05 2020-10-05 18:00:00+00:00 2020-10-05 22:00:00+00:00

######### PROCESS TIMES
data = pd.DataFrame([
   ["name1","2020-07-17 08:00:00+00:00","2020-07-18 22:00:00+00:00"],
   ["name2","2020-08-24 01:00:00+00:00","2020-08-24 12:00:00+00:00"],
   ["name3","2020-09-20 07:00:00+00:00","2020-09-20 19:00:00+00:00"],
   ["name4","2020-09-04 16:00:00+00:00","2020-09-04 18:50:00+00:00"],
   ["name5","2020-10-04 11:00:00+00:00","2020-10-05 20:00:00+00:00"],
   ["name6","2020-09-03 10:00:00+00:00","2020-09-06 05:00:00+00:00"]
], columns = ["process", "start", "end"])

data["start"] = pd.to_datetime(data["start"])
data["end"] = pd.to_datetime(data["end"])

data["duration"] = data.end -data.start
data
>>
  process                     start                       end        duration
0   name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00
1   name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00
2   name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00
3   name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00
4   name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00
5   name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00

As you can see, I added a row to off on 2020-10-04, so that name5 has 2 off times, which could happen in your data and would need to be handled correctly. (this means that in the example in your question, you need to subtract 5 hours instead of 2) I also added the process name6 which is multiple days long.

This is my solution, which will be applied to each row in data

def get_relevant_off(pr):
    relevant = off[off.end.gt(pr["start"]) & off.start.lt(pr["end"])].copy()
    if not relevant.empty:
        relevant.loc[relevant["start"].lt(pr["start"]), "start"] = pr["start"]
        relevant.loc[relevant["end"].gt(pr["end"]), "end"] = pr["end"]
        to_subtract = (relevant.end - relevant.start).sum()
        return pr["duration"] - to_subtract
    else: return pr.duration

Explanation:

  • first row in the function subsets the relevant rows of off, based on the row pr
  • replace off starts that are lower than process starts with process starts and do the same with ends, since we don't want to sum the whole off time, but only what is actually at the same time as the process.
  • get the duration of off times by subtracting off starts from off ends and sum those
  • then subtract that from the total duration.
data["effective"] = data.apply(get_relevant_off, axis= 1)
data
>>
  process   start                       end                         duration            effective
0   name1   2020-07-17 08:00:00+00:00   2020-07-18 22:00:00+00:00   1 days 14:00:00     1 days 14:00:00
1   name2   2020-08-24 01:00:00+00:00   2020-08-24 12:00:00+00:00   0 days 11:00:00     0 days 04:00:00
2   name3   2020-09-20 07:00:00+00:00   2020-09-20 19:00:00+00:00   0 days 12:00:00     0 days 12:00:00
3   name4   2020-09-04 16:00:00+00:00   2020-09-04 18:50:00+00:00   0 days 02:50:00     0 days 00:00:00
4   name5   2020-10-04 11:00:00+00:00   2020-10-05 20:00:00+00:00   1 days 09:00:00     1 days 04:00:00
5   name6   2020-09-03 10:00:00+00:00   2020-09-06 05:00:00+00:00   2 days 19:00:00     1 days 19:00:01

Caveat: I am assuming that off times never overlap. Also, I liked this problem, but don't have any more time to spend on testing this, so let me know if I overlooked some edge cases that break it and I will try to find the time to fix it.

Upvotes: 1

Related Questions