st4co4
st4co4

Reputation: 477

How to remove overlaps from intervals?

I have two datasets:

  1. one consisting time periods when persons worked
  2. second consisting time periods when persons were on vacation

I need to remove these vacation periods from working periods as shown on the illustration

enter image description here

I find ivs package promising, I was able to merge intervals but couldn't remove overlaps. How to do it?

library(tidyverse)
library(ivs)

worked = read.csv("https://www.dropbox.com/scl/fi/ctu6d8nb193ub0t5wawdf/worked.csv?rlkey=3zsbe8gh483hb0daazrrwl9ko&dl=1")
vacation = read.csv("https://www.dropbox.com/scl/fi/xn76ojwmewpcz71tcivgs/vacation.csv?rlkey=kmllrgvgr6ix0z2wq6roxin8f&dl=1")

This is how the data looks for one person

enter image description here

Upvotes: 1

Views: 94

Answers (2)

margusl
margusl

Reputation: 17544

With ivs we can use iv_set_difference() to find non-overlapping intervals; first we could create list columns with intervals for each id in both datasets and join, then iterate through worked-vacation interval set pairs with purrr::map2() to get set differences for every id. And finally unnest and extract start & end values.

library(tidyverse)
library(ivs)

left_join(
  summarise(worked,   wrk_iv = iv(start, end) |> list(), .by = id),
  summarise(vacation, vac_iv = iv(start, end) |> list(), .by = id),
) |> 
  # check join
  # print() |>  
  #> Joining with `by = join_by(id)`
  #> # A tibble: 4 × 3
  #>   id       wrk_iv         vac_iv        
  #>   <chr>    <list>         <list>        
  #> 1 ID100045 <iv<date> [2]> <iv<date> [3]>
  #> 2 ID100082 <iv<date> [1]> <iv<date> [1]>
  #> 3 ID100089 <iv<date> [1]> <NULL>        
  #> 4 ID100093 <iv<date> [2]> <iv<date> [1]>
  #> # A tibble: 9 × 4

  mutate(diff = map2(wrk_iv, vac_iv, iv_set_difference)) |> 
  select(id, diff) |> 
  unnest(diff) |> 
  mutate(start = iv_start(diff), end = iv_end(diff))

Result:

#>   id                           diff start      end       
#>   <chr>                  <iv<date>> <date>     <date>    
#> 1 ID100045 [2019-06-04, 2020-07-08) 2019-06-04 2020-07-08
#> 2 ID100045 [2020-11-23, 2020-11-30) 2020-11-23 2020-11-30
#> 3 ID100045 [2020-12-29, 2021-02-03) 2020-12-29 2021-02-03
#> 4 ID100045 [2021-04-25, 2022-04-10) 2021-04-25 2022-04-10
#> 5 ID100045 [2022-05-10, 9999-10-31) 2022-05-10 9999-10-31
#> 6 ID100082 [2007-11-22, 2021-11-11) 2007-11-22 2021-11-11
#> 7 ID100089 [2019-09-02, 9999-10-31) 2019-09-02 9999-10-31
#> 8 ID100093 [2018-05-01, 2020-12-08) 2018-05-01 2020-12-08
#> 9 ID100093 [2021-11-26, 9999-10-31) 2021-11-26 9999-10-31

Example data:

worked <- 
read_csv("id,start,end
ID100045,2019-06-04,2020-11-30
ID100045,2020-12-29,9999-10-31
ID100082,2007-11-22,2021-11-11
ID100089,2019-09-02,9999-10-31
ID100093,2018-05-01,2020-12-08
ID100093,2021-06-29,9999-10-31", show_col_types = FALSE)

vacation <- 
read_csv("id,start,end
ID100045,2020-07-08,2020-11-23
ID100045,2021-02-03,2021-04-25
ID100045,2022-04-10,2022-05-10
ID100082,2021-12-12,2022-07-12
ID100093,2021-01-12,2021-11-26", show_col_types = FALSE)

Created on 2024-10-16 with reprex v2.1.1

Upvotes: 2

Z&#233; Loff
Z&#233; Loff

Reputation: 1712

Caveat: This works with your example data, but it might warrant some further testing on some odd cases. Specifically, it assumes that all vacation periods are contained within the working periods, and I have not tested what happens when that is not true.

First, create the data.frames, since we don't have your CSV files. Note that you need to convert the date columns from character to Date.

worked <- data.frame(
  ID = c("ID100045", "ID100045"),
  start = as.Date(c("2019-06-04", "2020-12-29")),
  end = as.Date(c("2020-11-30", "9999-10-31")))

vacation <- data.frame(
  ID = c("ID100045", "ID100045", "ID100045"),
  start = as.Date(c("2020-07-08", "2021-02-03", "2022-04-10")),
  end = as.Date(c("2020-11-23", "2021-04-25", "2022-05-10")))

Build a "timeline" with all dates in sequence. We'll add a column called at_work indicating when the person started or stopped working on that date:

library(tidyr)
library(dplyr)

timeline <-
  bind_rows(
    pivot_longer(worked, -ID) %>%
    mutate(at_work = name == "start"),
    pivot_longer(vacation, -ID) %>%
    mutate(at_work = name == "end")
  ) %>%
arrange(ID, value)
> timeline
# A tibble: 10 × 4
   ID       name  value      at_work
   <chr>    <chr> <date>     <lgl>
 1 ID100045 start 2019-06-04 TRUE
 2 ID100045 start 2020-07-08 FALSE
 3 ID100045 end   2020-11-23 TRUE
 4 ID100045 end   2020-11-30 FALSE
 5 ID100045 start 2020-12-29 TRUE
 6 ID100045 start 2021-02-03 FALSE
 7 ID100045 end   2021-04-25 TRUE
 8 ID100045 start 2022-04-10 FALSE
 9 ID100045 end   2022-05-10 TRUE
10 ID100045 end   2024-10-16 FALSE

Approach 1

Check if the person started working and was not working on the previous date (as a safeguard against spurious consecutive at_work == TRUE rows), and calculate the cumulative sum of that. This will produce a column indicating each separate "working period", numbering them sequentially. Then, take each of those periods and get the start and the end of each.

timeline %>%
  group_by(ID) %>%
  mutate(period = cumsum(at_work & !lead(at_work))) %>%
  group_by(ID, period) %>%
  summarise(start = min(value), end = max(value))
# A tibble: 5 × 4
# Groups:   ID [1]
  ID       period start      end
  <chr>     <int> <date>     <date>
1 ID100045      1 2019-06-04 2020-07-08
2 ID100045      2 2020-11-23 2020-11-30
3 ID100045      3 2020-12-29 2021-02-03
4 ID100045      4 2021-04-25 2022-04-10
5 ID100045      5 2022-05-10 2024-10-16

Approach 2 (unsafer)

If you are entirely sure that all vacation periods start and end within a working period, this can be done by simply using pivot_wider and unnest:

timeline %>%
  select(-name) %>%
  pivot_wider(names_from = at_work, values_from = value) %>%
  unnest(cols = c(`TRUE`, `FALSE`))
# A tibble: 5 × 3
  ID       `TRUE`     `FALSE`
  <chr>    <date>     <date>
1 ID100045 2019-06-04 2020-07-08
2 ID100045 2020-11-23 2020-11-30
3 ID100045 2020-12-29 2021-02-03
4 ID100045 2021-04-25 2022-04-10
5 ID100045 2022-05-10 2024-10-16

Upvotes: 1

Related Questions