Reputation: 143
Can someone help me with this problem? I need to create a column Numb_times_with_overlap that counts the number of times that each machine is working at the same time. For that, I need to compare the time intervals worked for each machine and analyze if there are overlaps. The idea would be to obtain a table similar to this
Code in R:
datateste <- data.frame(id = 1:9,
machine_id = c("A","B","C","A","A","B","C","B","A"),
start_time = c("01/12/2021 00:00:00","01/12/2021 04:15:10","01/12/2021 00:15:00","01/12/2021 00:05:07","01/12/2021 00:35:00","01/12/2021 04:00:00","01/12/2021 04:07:00","01/12/2021 04:44:34","01/12/2021 00:15:00"),
end_time = c("01/12/2021 00:10:10","01/12/2021 04:45:03","01/12/2021 00:30:53","01/12/2021 00:30:02","01/12/2021 00:39:00","01/12/2021 04:12:45","01/12/2021 04:34:00","01/12/2021 05:06:34","01/12/2021 00:35:00"))
datateste
Upvotes: 1
Views: 1003
Reputation: 3636
First I would recommend using lubridate
since this involves time intervals and purrr
since you are essentially needing iterate through the individual intervals to compare against the list.
Two notes based on your desired outcome. First, lubridate
views intervals as inclusive of their boundaries, i.e. if two intervals share a boundary they overlap. This means, for example, 5A and 9A overlap because one starts at 00:35:00 and the other ends at 00:35:00. Second, while it's not clear from your problem description, it appears you only want to compare across the same machine. For example, it doesn't count that 4A is running at the same time as 3C.
library(dplyr)
library(lubridate)
library(purrr)
datateste %>%
mutate(
run_interval = interval(dmy_hms(start_time), dmy_hms(end_time)),
numb_times_with_overlap = imap_int(
run_interval,
~sum(
int_overlaps(.x, run_interval) &
(machine_id == machine_id[.y])
) - 1L
)
) %>%
select(-run_interval)
#> id machine_id start_time end_time numb_times_with_overlap
#> 1 1 A 01/12/2021 00:00:00 01/12/2021 00:10:10 1
#> 2 2 B 01/12/2021 04:15:10 01/12/2021 04:45:03 1
#> 3 3 C 01/12/2021 00:15:00 01/12/2021 00:30:53 0
#> 4 4 A 01/12/2021 00:05:07 01/12/2021 00:30:02 2
#> 5 5 A 01/12/2021 00:35:00 01/12/2021 00:39:00 1
#> 6 6 B 01/12/2021 04:00:00 01/12/2021 04:12:45 0
#> 7 7 C 01/12/2021 04:07:00 01/12/2021 04:34:00 0
#> 8 8 B 01/12/2021 04:44:34 01/12/2021 05:06:34 1
#> 9 9 A 01/12/2021 00:15:00 01/12/2021 00:35:00 2
What's going on here is first, the we create a column of intervals (run_interval
). Then we can then iterate through this column using a map function, in this case imap_int
. This compares the row to the entire column to using int_overlap
, then includes only rows where machine_id
matches that row (the .y
argument is the index of the current row). Summing a logical vector gives you a count and you need to subtract 1 to account for self-matching.
The counts are different from your expected results because of the previously mentioned way lubridate
determines overlaps. I suppose a work around would be to add a microsecond to each start time if you don't want this definition of an interval.
Upvotes: 1