user319487
user319487

Reputation:

How can I conditionally collapse groups of records in R data.table?

I have a dataset that list multiple records per ID that have duration (start & end ) associated with them.
It is in the form:

   ID      start        end treatment
1:  1 2009-12-01 2009-12-14         A
2:  1 2009-12-15 2009-12-31         A
3:  2 2009-12-01 2009-12-15         B
4:  2 2009-12-15       <NA>         B
5:  3 2009-12-01 2009-12-15         A
6:  3 2009-12-10 2009-12-31         B

Or:

library(data.table)
data <- data.table(ID = c(1, 1, 2, 2, 3, 3), 
                   start = anytime::anydate(c("2009-12-01", "2009-12-15", 
                             "2009-12-01", "2009-12-15", "2009-12-01", "2009-12-10")), 
                   end = anytime::anydate(c("2009-12-14", "2009-12-31", "2009-12-15", NA, "2009-12-15", "2009-12-31")), 
                   treatment = c("A", "A", "B", "B", "A","B"))

I'm struggling to coerce it to the form where multiple records with the same ID and same or next day start & end dates are collapsed and appropriate dates are retained. In case of example end product would look like:

   ID      start        end treatment
1:  1 2009-12-01 2009-12-31         A
2:  2 2009-12-01       <NA>         B
3:  3 2009-12-01 2009-12-15         A
4:  3 2009-12-10 2009-12-31         B

I tried solving that with some intermediate columns, groupuing and retaining unique rows but somehow have a feeling there is a better solution here

setorder(data, ID, start, end, na.last=FALSE)

data[, start2 := min(start), by = list(ID, treatment)]
data[, end2 := max(end), by = list(ID, treatment)]

unique(data, by = c("ID", "start2", "end2", "treatment"))

Any suggestions?

Upvotes: 1

Views: 337

Answers (2)

LocoGris
LocoGris

Reputation: 4480

How about this?

data[, .(start = min(start), 
         end = max(end)), by=.(ID,treatment)][,.(ID,start,end,treatment)]
 #      ID      start        end treatment
 #   1:  1 2009-12-01 2009-12-31         A
 #   2:  2 2009-12-01       <NA>         B
 #   3:  3 2009-12-01 2009-12-15         A
 #   4:  3 2009-12-10 2009-12-31         B

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389345

Group by rleid of treatment values get min and max of start and end columns.

library(data.table)

data[, .(start = min(start), 
         end = max(end), 
         treatment = first(treatment), 
         ID = first(ID)), rleid(treatment)]

#   rleid      start        end treatment ID
#1:     1 2009-12-01 2009-12-31         A  1
#2:     2 2009-12-01       <NA>         B  2
#3:     3 2009-12-01 2009-12-15         A  3
#4:     4 2009-12-10 2009-12-31         B  3

Instead of min and max you can also use first and last if your data is ordered.

Upvotes: 2

Related Questions