threeisles
threeisles

Reputation: 331

Do you know a more elegant way to calculate number of events on preceeding days?

I'd like your thoughts on a more elegant way to work out the number of events that have occured in preceeding days. My code (below) works, but it is not very nice, or scalable. I'm trying to get to the table at the bottom (desired_table). Any thoughts?

I would like to calculate the sum of events in the preceeding days in a more elegant manner than this.

require(data.table)

# simulating an example data.table
date <- c("2000-01-01", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06" , "2000-01-01", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06")

cohort <- c("a", "b", "c")

zz <- data.table(DATE = date, COHORT = cohort)
zz$DATE <- as.Date(zz$DATE)  # making sure the date is in the correct format

# adding on some other date fields so we can summarise by these days as well
zz$d1 <- zz$DATE +1  # will become "yesterday" when joined
zz$d2 <- zz$DATE +2  # will become "day before yesterday", when joined 

# summarising the data for the first date
summary1 <- zz[,list(events_today = .N ), by= c("DATE", "COHORT")]

# summarising the data for the previous
summary2 <- zz[,list(events_yesterday  = .N ), by= c("d1", "COHORT")]

# summarising the data for the day before yesterday
summary3 <- zz[,list(events_day_before_yesterday  = .N ), by= c("d2", "COHORT")]

# merging the tables together
summary1.2 <- merge(summary1, summary2, by.x = c("DATE", "COHORT"), by.y = c("d1", "COHORT"), all = TRUE)

# merging the tables together to join on third summary table.
desired_table <- merge(summary1.2, summary3, by.x = c("DATE", "COHORT"), by.y = c("d2", "COHORT"), all = TRUE)

print(desired_table)

There has to be a more elegant way to do this?
My example here is trivial - in practice I may want to do this for many thousands records, and many time periods.

Upvotes: 2

Views: 93

Answers (1)

kath
kath

Reputation: 7724

I think a more elegant way would be

long_zz <- melt(zz, id.vars = "COHORT")
new_zz <- dcast(long_zz, COHORT + value ~ variable, fun = length, drop = FALSE, fill = NA)
new_zz
#     COHORT      value DATE d1 d2
#  1:      a 2000-01-01    1 NA NA
#  2:      a 2000-01-02    1  1 NA
#  3:      a 2000-01-03    1  1  1
#  4:      a 2000-01-04   NA  1  1
#  5:      a 2000-01-05    2 NA  1
#  6:      a 2000-01-06    2  2 NA
#  7:      a 2000-01-07   NA  2  2
#  8:      a 2000-01-08   NA NA  2
#  9:      b 2000-01-01    2 NA NA
# 10:      b 2000-01-02   NA  2 NA
# 11:      b 2000-01-03    1 NA  2
# 12:      b 2000-01-04    2  1 NA
# 13:      b 2000-01-05   NA  2  1
# 14:      b 2000-01-06    2 NA  2
# 15:      b 2000-01-07   NA  2 NA
# 16:      b 2000-01-08   NA NA  2
# 17:      c 2000-01-01    1 NA NA
# 18:      c 2000-01-02    1  1 NA
# 19:      c 2000-01-03   NA  1  1
# 20:      c 2000-01-04    2 NA  1
# 21:      c 2000-01-05    2  2 NA
# 22:      c 2000-01-06   NA  2  2
# 23:      c 2000-01-07   NA NA  2
# 24:      c 2000-01-08   NA NA NA
#     COHORT      value DATE d1 d2

Here, I first transform the data from a wide format to a long one and then use caste to split the variables (DATE, d1, d2) into columns again, while counting the number of rows in each COHORT, value group for each variable. Without drop = FALSE, we would miss row 24, where no events happend for COHORT c.

You can set your names with

setnames(new_zz, c("value", "DATE", "d1", "d2"), c("DATE", "events_today","events_yesterday","events_day_before_yesterday")) 

mircobenchmark-results of your approach (merge) vs. mine (long_wide):

Unit: milliseconds
      expr       min        lq     mean    median        uq      max neval
     merge 14.740983 18.534281 31.88430 21.223305 31.830966 353.3662   100
 long_wide  5.102077  6.411999 10.82941  7.130821  8.884161 117.7351   100

Upvotes: 5

Related Questions