Reputation: 477
I have two datasets:
I need to remove these vacation periods from working periods as shown on the illustration
I find ivs package promising, I was able to merge intervals but couldn't remove overlaps. How to do it?
library(tidyverse)
library(ivs)
worked = read.csv("https://www.dropbox.com/scl/fi/ctu6d8nb193ub0t5wawdf/worked.csv?rlkey=3zsbe8gh483hb0daazrrwl9ko&dl=1")
vacation = read.csv("https://www.dropbox.com/scl/fi/xn76ojwmewpcz71tcivgs/vacation.csv?rlkey=kmllrgvgr6ix0z2wq6roxin8f&dl=1")
This is how the data looks for one person
Upvotes: 1
Views: 94
Reputation: 17544
With ivs
we can use iv_set_difference()
to find non-overlapping intervals; first we could create list columns with intervals for each id in both datasets and join, then iterate through worked-vacation interval set pairs with purrr::map2()
to get set differences for every id. And finally unnest and extract start & end values.
library(tidyverse)
library(ivs)
left_join(
summarise(worked, wrk_iv = iv(start, end) |> list(), .by = id),
summarise(vacation, vac_iv = iv(start, end) |> list(), .by = id),
) |>
# check join
# print() |>
#> Joining with `by = join_by(id)`
#> # A tibble: 4 × 3
#> id wrk_iv vac_iv
#> <chr> <list> <list>
#> 1 ID100045 <iv<date> [2]> <iv<date> [3]>
#> 2 ID100082 <iv<date> [1]> <iv<date> [1]>
#> 3 ID100089 <iv<date> [1]> <NULL>
#> 4 ID100093 <iv<date> [2]> <iv<date> [1]>
#> # A tibble: 9 × 4
mutate(diff = map2(wrk_iv, vac_iv, iv_set_difference)) |>
select(id, diff) |>
unnest(diff) |>
mutate(start = iv_start(diff), end = iv_end(diff))
Result:
#> id diff start end
#> <chr> <iv<date>> <date> <date>
#> 1 ID100045 [2019-06-04, 2020-07-08) 2019-06-04 2020-07-08
#> 2 ID100045 [2020-11-23, 2020-11-30) 2020-11-23 2020-11-30
#> 3 ID100045 [2020-12-29, 2021-02-03) 2020-12-29 2021-02-03
#> 4 ID100045 [2021-04-25, 2022-04-10) 2021-04-25 2022-04-10
#> 5 ID100045 [2022-05-10, 9999-10-31) 2022-05-10 9999-10-31
#> 6 ID100082 [2007-11-22, 2021-11-11) 2007-11-22 2021-11-11
#> 7 ID100089 [2019-09-02, 9999-10-31) 2019-09-02 9999-10-31
#> 8 ID100093 [2018-05-01, 2020-12-08) 2018-05-01 2020-12-08
#> 9 ID100093 [2021-11-26, 9999-10-31) 2021-11-26 9999-10-31
Example data:
worked <-
read_csv("id,start,end
ID100045,2019-06-04,2020-11-30
ID100045,2020-12-29,9999-10-31
ID100082,2007-11-22,2021-11-11
ID100089,2019-09-02,9999-10-31
ID100093,2018-05-01,2020-12-08
ID100093,2021-06-29,9999-10-31", show_col_types = FALSE)
vacation <-
read_csv("id,start,end
ID100045,2020-07-08,2020-11-23
ID100045,2021-02-03,2021-04-25
ID100045,2022-04-10,2022-05-10
ID100082,2021-12-12,2022-07-12
ID100093,2021-01-12,2021-11-26", show_col_types = FALSE)
Created on 2024-10-16 with reprex v2.1.1
Upvotes: 2
Reputation: 1712
Caveat: This works with your example data, but it might warrant some further testing on some odd cases. Specifically, it assumes that all vacation periods are contained within the working periods, and I have not tested what happens when that is not true.
First, create the data.frame
s, since we don't have your CSV files. Note that you need to convert the date columns from character
to Date
.
worked <- data.frame(
ID = c("ID100045", "ID100045"),
start = as.Date(c("2019-06-04", "2020-12-29")),
end = as.Date(c("2020-11-30", "9999-10-31")))
vacation <- data.frame(
ID = c("ID100045", "ID100045", "ID100045"),
start = as.Date(c("2020-07-08", "2021-02-03", "2022-04-10")),
end = as.Date(c("2020-11-23", "2021-04-25", "2022-05-10")))
Build a "timeline" with all dates in sequence. We'll add a column called at_work
indicating when the person started or stopped working on that date:
library(tidyr)
library(dplyr)
timeline <-
bind_rows(
pivot_longer(worked, -ID) %>%
mutate(at_work = name == "start"),
pivot_longer(vacation, -ID) %>%
mutate(at_work = name == "end")
) %>%
arrange(ID, value)
> timeline
# A tibble: 10 × 4
ID name value at_work
<chr> <chr> <date> <lgl>
1 ID100045 start 2019-06-04 TRUE
2 ID100045 start 2020-07-08 FALSE
3 ID100045 end 2020-11-23 TRUE
4 ID100045 end 2020-11-30 FALSE
5 ID100045 start 2020-12-29 TRUE
6 ID100045 start 2021-02-03 FALSE
7 ID100045 end 2021-04-25 TRUE
8 ID100045 start 2022-04-10 FALSE
9 ID100045 end 2022-05-10 TRUE
10 ID100045 end 2024-10-16 FALSE
Check if the person started working and was not working on the previous date (as a safeguard against spurious consecutive at_work == TRUE
rows), and calculate the cumulative sum of that. This will produce a column indicating each separate "working period", numbering them sequentially.
Then, take each of those periods and get the start and the end of each.
timeline %>%
group_by(ID) %>%
mutate(period = cumsum(at_work & !lead(at_work))) %>%
group_by(ID, period) %>%
summarise(start = min(value), end = max(value))
# A tibble: 5 × 4
# Groups: ID [1]
ID period start end
<chr> <int> <date> <date>
1 ID100045 1 2019-06-04 2020-07-08
2 ID100045 2 2020-11-23 2020-11-30
3 ID100045 3 2020-12-29 2021-02-03
4 ID100045 4 2021-04-25 2022-04-10
5 ID100045 5 2022-05-10 2024-10-16
If you are entirely sure that all vacation periods start and end within a working period, this can be done by simply using pivot_wider
and unnest
:
timeline %>%
select(-name) %>%
pivot_wider(names_from = at_work, values_from = value) %>%
unnest(cols = c(`TRUE`, `FALSE`))
# A tibble: 5 × 3
ID `TRUE` `FALSE`
<chr> <date> <date>
1 ID100045 2019-06-04 2020-07-08
2 ID100045 2020-11-23 2020-11-30
3 ID100045 2020-12-29 2021-02-03
4 ID100045 2021-04-25 2022-04-10
5 ID100045 2022-05-10 2024-10-16
Upvotes: 1