Reputation: 35
I have a data frame where each record represents a service provided to a customer over some interval of time. The data frame is sorted first by customer number, then by time interval.
I need to add a new column to the data frame that assigns a group number to the records for each client. The first record for a given client should always be in Group 1. The group number should advance for subsequent records only when a certain condition is satisfied by the record immediately preceding it.
Here is a minimal working example:
I want to add a column to this data frame so that it looks like this.
Here's the code to create the data frame in this minimal example:
library(lubridate)
df <- data.frame(rbind(c(1,"2010-01-01","2010-04-05",FALSE),
c(1,"2010-04-05","2010-08-09",FALSE),
c(1,"2010-08-09","2010-11-13",TRUE),
c(1,"2010-11-13","2011-03-14",FALSE),
c(1,"2011-03-14","2011-06-29",FALSE),
c(1,"2011-06-29","2011-09-30",TRUE),
c(1,"2011-09-30","2012-03-01",FALSE),
c(2,"2012-04-12","2012-05-31",FALSE),
c(2,"2012-05-31","2012-10-19",FALSE),
c(2,"2012-10-19","2013-01-16",FALSE),
c(3,"2009-06-11","2009-10-14",TRUE),
c(3,"2009-10-14","2009-12-03",FALSE)))
colnames(df) <- c("Client","StartDate","StopDate","Condition")
df$StartDate <- as.Date(df$StartDate,format="%Y-%m-%d")
df$StopDate <- as.Date(df$StopDate,format="%Y-%m-%d")
df$Condition <- as.logical(df$Condition)
I can easily do what I want with a loop,
df[1,"Group"] <- 1
for(i in 2:nrow(df)){
df[i,"Group"] <- if(df[i,"Client"]!=df[i-1,"Client"]){1} else
if(df[i-1,"Condition"]){df[i-1,"Group"]+1} else{df[i-1,"Group"]}
}
but my data frame has millions of records, so I'd like to find a vectorized solution that is faster.
Upvotes: 2
Views: 33
Reputation: 20085
A fast solution could be achieved using data.table
. The requirement seems to be count TRUE
condition till previous row for the same Client
. by=Client
will ensure calculations are limited to a Client
and then subtract 1
if Condition
is TRUE
for current row will provide desired value.
The steps would be:
library(data.table)
setDT(df)
df[,Group := cumsum(as.numeric(Condition))+1 - as.numeric(Condition) , by=Client]
df
#Result
# Client StartDate StopDate Condition Group
# 1: 1 2010-01-01 2010-04-05 FALSE 1
# 2: 1 2010-04-05 2010-08-09 FALSE 1
# 3: 1 2010-08-09 2010-11-13 TRUE 1
# 4: 1 2010-11-13 2011-03-14 FALSE 2
# 5: 1 2011-03-14 2011-06-29 FALSE 2
# 6: 1 2011-06-29 2011-09-30 TRUE 2
# 7: 1 2011-09-30 2012-03-01 FALSE 3
# 8: 2 2012-04-12 2012-05-31 FALSE 1
# 9: 2 2012-05-31 2012-10-19 FALSE 1
#10: 2 2012-10-19 2013-01-16 FALSE 1
#11: 3 2009-06-11 2009-10-14 TRUE 1
#12: 3 2009-10-14 2009-12-03 FALSE 2
Data
library(lubridate)
df <- data.frame(rbind(c(1,"2010-01-01","2010-04-05",FALSE),
c(1,"2010-04-05","2010-08-09",FALSE),
c(1,"2010-08-09","2010-11-13",TRUE),
c(1,"2010-11-13","2011-03-14",FALSE),
c(1,"2011-03-14","2011-06-29",FALSE),
c(1,"2011-06-29","2011-09-30",TRUE),
c(1,"2011-09-30","2012-03-01",FALSE),
c(2,"2012-04-12","2012-05-31",FALSE),
c(2,"2012-05-31","2012-10-19",FALSE),
c(2,"2012-10-19","2013-01-16",FALSE),
c(3,"2009-06-11","2009-10-14",TRUE),
c(3,"2009-10-14","2009-12-03",FALSE)))
colnames(df) <- c("Client","StartDate","StopDate","Condition")
df$StartDate <- as.Date(df$StartDate,format="%Y-%m-%d")
df$StopDate <- as.Date(df$StopDate,format="%Y-%m-%d")
df$Condition <- as.logical(df$Condition)
Upvotes: 0
Reputation: 1902
What you want is close to a grouped application of cumsum()
. This works, but I don't know how fast it would be:
library(tidyverse)
df %>%
group_by(Client) %>%
mutate(group = cumsum(Condition) + 1,
group = case_when(
Condition ~ group - 1,
!Condition ~ group))
#> # A tibble: 12 x 6
#> # Groups: Client [3]
#> Client StartDate StopDate Condition Group group
#> <fct> <date> <date> <lgl> <dbl> <dbl>
#> 1 1 2010-01-01 2010-04-05 F 1.00 1.00
#> 2 1 2010-04-05 2010-08-09 F 1.00 1.00
#> 3 1 2010-08-09 2010-11-13 T 1.00 1.00
#> 4 1 2010-11-13 2011-03-14 F 2.00 2.00
#> 5 1 2011-03-14 2011-06-29 F 2.00 2.00
#> 6 1 2011-06-29 2011-09-30 T 2.00 2.00
#> 7 1 2011-09-30 2012-03-01 F 3.00 3.00
#> 8 2 2012-04-12 2012-05-31 F 1.00 1.00
#> 9 2 2012-05-31 2012-10-19 F 1.00 1.00
#> 10 2 2012-10-19 2013-01-16 F 1.00 1.00
#> 11 3 2009-06-11 2009-10-14 T 1.00 1.00
#> 12 3 2009-10-14 2009-12-03 F 2.00 2.00
Upvotes: 1