Add a Conditionally-Advancing Counter to a Data Frame in R

Question

I have a data frame where each record represents a service provided to a customer over some interval of time. The data frame is sorted first by customer number, then by time interval.

I need to add a new column to the data frame that assigns a group number to the records for each client. The first record for a given client should always be in Group 1. The group number should advance for subsequent records only when a certain condition is satisfied by the record immediately preceding it.

Here is a minimal working example:

I want to add a column to this data frame so that it looks like this.

Here's the code to create the data frame in this minimal example:

library(lubridate)
df <- data.frame(rbind(c(1,"2010-01-01","2010-04-05",FALSE),
                   c(1,"2010-04-05","2010-08-09",FALSE),
                   c(1,"2010-08-09","2010-11-13",TRUE),
                   c(1,"2010-11-13","2011-03-14",FALSE),
                   c(1,"2011-03-14","2011-06-29",FALSE),
                   c(1,"2011-06-29","2011-09-30",TRUE),
                   c(1,"2011-09-30","2012-03-01",FALSE),
                   c(2,"2012-04-12","2012-05-31",FALSE),
                   c(2,"2012-05-31","2012-10-19",FALSE),
                   c(2,"2012-10-19","2013-01-16",FALSE),
                   c(3,"2009-06-11","2009-10-14",TRUE),
                   c(3,"2009-10-14","2009-12-03",FALSE)))
colnames(df) <- c("Client","StartDate","StopDate","Condition")
df$StartDate <- as.Date(df$StartDate,format="%Y-%m-%d")
df$StopDate <- as.Date(df$StopDate,format="%Y-%m-%d")
df$Condition <- as.logical(df$Condition)

I can easily do what I want with a loop,

df[1,"Group"] <- 1
for(i in 2:nrow(df)){
  df[i,"Group"] <- if(df[i,"Client"]!=df[i-1,"Client"]){1} else
                   if(df[i-1,"Condition"]){df[i-1,"Group"]+1} else{df[i-1,"Group"]}
}

but my data frame has millions of records, so I'd like to find a vectorized solution that is faster.

atiretoo · Accepted Answer

What you want is close to a grouped application of cumsum(). This works, but I don't know how fast it would be:

library(tidyverse)
df %>%
  group_by(Client) %>%
  mutate(group = cumsum(Condition) + 1,
         group = case_when(
           Condition ~ group - 1,
            !Condition ~ group))
#> # A tibble: 12 x 6
#> # Groups:   Client [3]
#>    Client StartDate  StopDate   Condition Group group
#>                      
#>  1 1      2010-01-01 2010-04-05 F          1.00  1.00
#>  2 1      2010-04-05 2010-08-09 F          1.00  1.00
#>  3 1      2010-08-09 2010-11-13 T          1.00  1.00
#>  4 1      2010-11-13 2011-03-14 F          2.00  2.00
#>  5 1      2011-03-14 2011-06-29 F          2.00  2.00
#>  6 1      2011-06-29 2011-09-30 T          2.00  2.00
#>  7 1      2011-09-30 2012-03-01 F          3.00  3.00
#>  8 2      2012-04-12 2012-05-31 F          1.00  1.00
#>  9 2      2012-05-31 2012-10-19 F          1.00  1.00
#> 10 2      2012-10-19 2013-01-16 F          1.00  1.00
#> 11 3      2009-06-11 2009-10-14 T          1.00  1.00
#> 12 3      2009-10-14 2009-12-03 F          2.00  2.00

Add a Conditionally-Advancing Counter to a Data Frame in R

Answers (2)

Related Questions