Jay
Jay

Reputation: 35

Add a Conditionally-Advancing Counter to a Data Frame in R

I have a data frame where each record represents a service provided to a customer over some interval of time. The data frame is sorted first by customer number, then by time interval.

I need to add a new column to the data frame that assigns a group number to the records for each client. The first record for a given client should always be in Group 1. The group number should advance for subsequent records only when a certain condition is satisfied by the record immediately preceding it.

Here is a minimal working example:

I want to add a column to this data frame so that it looks like this.

Here's the code to create the data frame in this minimal example:

library(lubridate)
df <- data.frame(rbind(c(1,"2010-01-01","2010-04-05",FALSE),
                   c(1,"2010-04-05","2010-08-09",FALSE),
                   c(1,"2010-08-09","2010-11-13",TRUE),
                   c(1,"2010-11-13","2011-03-14",FALSE),
                   c(1,"2011-03-14","2011-06-29",FALSE),
                   c(1,"2011-06-29","2011-09-30",TRUE),
                   c(1,"2011-09-30","2012-03-01",FALSE),
                   c(2,"2012-04-12","2012-05-31",FALSE),
                   c(2,"2012-05-31","2012-10-19",FALSE),
                   c(2,"2012-10-19","2013-01-16",FALSE),
                   c(3,"2009-06-11","2009-10-14",TRUE),
                   c(3,"2009-10-14","2009-12-03",FALSE)))
colnames(df) <- c("Client","StartDate","StopDate","Condition")
df$StartDate <- as.Date(df$StartDate,format="%Y-%m-%d")
df$StopDate <- as.Date(df$StopDate,format="%Y-%m-%d")
df$Condition <- as.logical(df$Condition)

I can easily do what I want with a loop,

df[1,"Group"] <- 1
for(i in 2:nrow(df)){
  df[i,"Group"] <- if(df[i,"Client"]!=df[i-1,"Client"]){1} else
                   if(df[i-1,"Condition"]){df[i-1,"Group"]+1} else{df[i-1,"Group"]}
}

but my data frame has millions of records, so I'd like to find a vectorized solution that is faster.

Upvotes: 2

Views: 33

Answers (2)

MKR
MKR

Reputation: 20085

A fast solution could be achieved using data.table. The requirement seems to be count TRUE condition till previous row for the same Client. by=Client will ensure calculations are limited to a Client and then subtract 1 if Condition is TRUE for current row will provide desired value.

The steps would be:

  library(data.table)
  setDT(df)

  df[,Group := cumsum(as.numeric(Condition))+1 - as.numeric(Condition) , by=Client]

df
#Result
#    Client  StartDate   StopDate Condition Group
# 1:      1 2010-01-01 2010-04-05     FALSE     1
# 2:      1 2010-04-05 2010-08-09     FALSE     1
# 3:      1 2010-08-09 2010-11-13      TRUE     1
# 4:      1 2010-11-13 2011-03-14     FALSE     2
# 5:      1 2011-03-14 2011-06-29     FALSE     2
# 6:      1 2011-06-29 2011-09-30      TRUE     2
# 7:      1 2011-09-30 2012-03-01     FALSE     3
# 8:      2 2012-04-12 2012-05-31     FALSE     1
# 9:      2 2012-05-31 2012-10-19     FALSE     1
#10:      2 2012-10-19 2013-01-16     FALSE     1
#11:      3 2009-06-11 2009-10-14      TRUE     1
#12:      3 2009-10-14 2009-12-03     FALSE     2

Data

  library(lubridate)
  df <- data.frame(rbind(c(1,"2010-01-01","2010-04-05",FALSE),
                         c(1,"2010-04-05","2010-08-09",FALSE),
                         c(1,"2010-08-09","2010-11-13",TRUE),
                         c(1,"2010-11-13","2011-03-14",FALSE),
                         c(1,"2011-03-14","2011-06-29",FALSE),
                         c(1,"2011-06-29","2011-09-30",TRUE),
                         c(1,"2011-09-30","2012-03-01",FALSE),
                         c(2,"2012-04-12","2012-05-31",FALSE),
                         c(2,"2012-05-31","2012-10-19",FALSE),
                         c(2,"2012-10-19","2013-01-16",FALSE),
                         c(3,"2009-06-11","2009-10-14",TRUE),
                         c(3,"2009-10-14","2009-12-03",FALSE)))

  colnames(df) <- c("Client","StartDate","StopDate","Condition")

  df$StartDate <- as.Date(df$StartDate,format="%Y-%m-%d")
  df$StopDate <- as.Date(df$StopDate,format="%Y-%m-%d")
  df$Condition <- as.logical(df$Condition)

Upvotes: 0

atiretoo
atiretoo

Reputation: 1902

What you want is close to a grouped application of cumsum(). This works, but I don't know how fast it would be:


library(tidyverse)
df %>%
  group_by(Client) %>%
  mutate(group = cumsum(Condition) + 1,
         group = case_when(
           Condition ~ group - 1,
            !Condition ~ group))
#> # A tibble: 12 x 6
#> # Groups:   Client [3]
#>    Client StartDate  StopDate   Condition Group group
#>    <fct>  <date>     <date>     <lgl>     <dbl> <dbl>
#>  1 1      2010-01-01 2010-04-05 F          1.00  1.00
#>  2 1      2010-04-05 2010-08-09 F          1.00  1.00
#>  3 1      2010-08-09 2010-11-13 T          1.00  1.00
#>  4 1      2010-11-13 2011-03-14 F          2.00  2.00
#>  5 1      2011-03-14 2011-06-29 F          2.00  2.00
#>  6 1      2011-06-29 2011-09-30 T          2.00  2.00
#>  7 1      2011-09-30 2012-03-01 F          3.00  3.00
#>  8 2      2012-04-12 2012-05-31 F          1.00  1.00
#>  9 2      2012-05-31 2012-10-19 F          1.00  1.00
#> 10 2      2012-10-19 2013-01-16 F          1.00  1.00
#> 11 3      2009-06-11 2009-10-14 T          1.00  1.00
#> 12 3      2009-10-14 2009-12-03 F          2.00  2.00

Upvotes: 1

Related Questions