Puckz
Puckz

Reputation: 43

Generate new variable based on start and stop date in dplyr

I am in need of assistance. I have some data that looks similar to this.

   Machine   Start      Stop           ServiceType 
1       XX 2014-12-04       <NA>          AA
2       XX 2013-09-05 2013-11-05          BB
3       XX 2013-11-21 2014-09-25          BB
4       XX 2013-10-11 2014-11-18          BB
5       XX 2021-12-03       <NA>          AA
6       XX 2020-08-06 2022-09-15          AA
7       XX 2021-06-10       <NA>          BB
8       YY 2020-01-17       <NA>          BB
9       YY 2015-11-04 2018-04-30          BB
10      YY 2016-05-28 2019-03-21          BB
11      YY 2019-09-27       <NA>          BB
12      YY 2018-01-05       <NA>          AA

So what I would like to do is generate a new variable, say Maintenance or something, that's either AA or BB if only one service types is active or CC if both AA and BB are overlapping. Like,

   Machine   Date          Maintenance        
1       XX 2013-09-05          BB
2       XX 2013-11-21          BB
3       XX 2013-10-11          AA
4       XX 2014-12-04          CC   
5       XX 2021-12-03          AA
6       YY 2015-11-04          BB
7       YY 2016-05-28          CC
8       YY 2020-01-17          BB

I have been working with dplyr and lubridate but I am a bit unsure how to perform this task, any help would be kind.

ps. NA's in this case can be considered a machine is on that service forever.

Upvotes: 4

Views: 450

Answers (1)

Ma&#235;l
Ma&#235;l

Reputation: 51994

There are some inconsistencies in your question (some variables do overlap but they are considered separate in your expected output), here's a way to group by overlapping values and get somewhat your expected output. This solution uses the ivs, tidyverse, and lubridate libraries:

library(ivs)
library(tidyverse)
library(lubridate)

df %>% 
  mutate(Stop = ifelse(Stop == "<NA>", Start, Stop),
         across(c(Start, Stop), ymd),
         Stop = if_else(Stop == Start, Stop + days(1), Stop),
         ivs = iv(Start, Stop)) %>% 
  group_by(Machine, gp = iv_identify_group(ivs)) %>% 
  summarise(ServiceType = toString(unique(ServiceType)),) %>% 
  ungroup() %>% 
  mutate(gp = iv_start(gp),
         ServiceType = ifelse(ServiceType %in% c("BB, AA", "AA, BB"), "CC", ServiceType))

# A tibble: 6 × 3
  gp         Machine ServiceType
  <date>     <chr>   <chr>      
1 2013-09-05 XX      BB         
2 2014-12-04 XX      AA         
3 2020-08-06 XX      CC         
4 2015-11-04 YY      CC         
5 2019-09-27 YY      BB         
6 2020-01-17 YY      BB         

data

df <- read.table(header = T, text = "   Machine   Start      Stop           ServiceType 
1       XX 2014-12-04       NA          AA
2       XX 2013-09-05 2013-11-05          BB
3       XX 2013-11-21 2014-09-25          BB
4       XX 2013-10-11 2014-11-18          BB
5       XX 2021-12-03       <NA>          AA
6       XX 2020-08-06 2022-09-15          AA
7       XX 2021-06-10       <NA>          BB
8       YY 2020-01-17       <NA>          BB
9       YY 2015-11-04 2018-04-30          BB
10      YY 2016-05-28 2019-03-21          BB
11      YY 2019-09-27       <NA>          BB
12      YY 2018-01-05       <NA>          AA
")

Upvotes: 4

Related Questions