RitaM
RitaM

Reputation: 143

Count how many times date ranges overlap for each category (machine)

Can someone help me with this problem? I need to create a column Numb_times_with_overlap that counts the number of times that each machine is working at the same time. For that, I need to compare the time intervals worked for each machine and analyze if there are overlaps. The idea would be to obtain a table similar to this

see expected result

Code in R:

datateste <- data.frame(id = 1:9,
                        machine_id = c("A","B","C","A","A","B","C","B","A"),
                        start_time = c("01/12/2021  00:00:00","01/12/2021  04:15:10","01/12/2021  00:15:00","01/12/2021  00:05:07","01/12/2021  00:35:00","01/12/2021  04:00:00","01/12/2021  04:07:00","01/12/2021  04:44:34","01/12/2021  00:15:00"),
                        end_time = c("01/12/2021  00:10:10","01/12/2021  04:45:03","01/12/2021  00:30:53","01/12/2021  00:30:02","01/12/2021  00:39:00","01/12/2021  04:12:45","01/12/2021  04:34:00","01/12/2021  05:06:34","01/12/2021  00:35:00"))

datateste

Upvotes: 1

Views: 1003

Answers (1)

Marcus
Marcus

Reputation: 3636

First I would recommend using lubridate since this involves time intervals and purrr since you are essentially needing iterate through the individual intervals to compare against the list.

Two notes based on your desired outcome. First, lubridate views intervals as inclusive of their boundaries, i.e. if two intervals share a boundary they overlap. This means, for example, 5A and 9A overlap because one starts at 00:35:00 and the other ends at 00:35:00. Second, while it's not clear from your problem description, it appears you only want to compare across the same machine. For example, it doesn't count that 4A is running at the same time as 3C.

library(dplyr)
library(lubridate)
library(purrr)

datateste %>% 
  mutate(
    run_interval = interval(dmy_hms(start_time), dmy_hms(end_time)),
    numb_times_with_overlap = imap_int(
      run_interval, 
      ~sum(
        int_overlaps(.x, run_interval) & 
          (machine_id == machine_id[.y])
        ) - 1L
    )
  ) %>% 
  select(-run_interval)

#>   id machine_id          start_time            end_time numb_times_with_overlap
#> 1  1          A 01/12/2021 00:00:00 01/12/2021 00:10:10                       1
#> 2  2          B 01/12/2021 04:15:10 01/12/2021 04:45:03                       1
#> 3  3          C 01/12/2021 00:15:00 01/12/2021 00:30:53                       0
#> 4  4          A 01/12/2021 00:05:07 01/12/2021 00:30:02                       2
#> 5  5          A 01/12/2021 00:35:00 01/12/2021 00:39:00                       1
#> 6  6          B 01/12/2021 04:00:00 01/12/2021 04:12:45                       0
#> 7  7          C 01/12/2021 04:07:00 01/12/2021 04:34:00                       0
#> 8  8          B 01/12/2021 04:44:34 01/12/2021 05:06:34                       1
#> 9  9          A 01/12/2021 00:15:00 01/12/2021 00:35:00                       2

What's going on here is first, the we create a column of intervals (run_interval). Then we can then iterate through this column using a map function, in this case imap_int. This compares the row to the entire column to using int_overlap, then includes only rows where machine_id matches that row (the .y argument is the index of the current row). Summing a logical vector gives you a count and you need to subtract 1 to account for self-matching.

The counts are different from your expected results because of the previously mentioned way lubridate determines overlaps. I suppose a work around would be to add a microsecond to each start time if you don't want this definition of an interval.

Upvotes: 1

Related Questions