Reputation: 331
I'd like your thoughts on a more elegant way to work out the number of events that have occured in preceeding days. My code (below) works, but it is not very nice, or scalable. I'm trying to get to the table at the bottom (desired_table). Any thoughts?
I would like to calculate the sum of events in the preceeding days in a more elegant manner than this.
require(data.table)
# simulating an example data.table
date <- c("2000-01-01", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06" , "2000-01-01", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06")
cohort <- c("a", "b", "c")
zz <- data.table(DATE = date, COHORT = cohort)
zz$DATE <- as.Date(zz$DATE) # making sure the date is in the correct format
# adding on some other date fields so we can summarise by these days as well
zz$d1 <- zz$DATE +1 # will become "yesterday" when joined
zz$d2 <- zz$DATE +2 # will become "day before yesterday", when joined
# summarising the data for the first date
summary1 <- zz[,list(events_today = .N ), by= c("DATE", "COHORT")]
# summarising the data for the previous
summary2 <- zz[,list(events_yesterday = .N ), by= c("d1", "COHORT")]
# summarising the data for the day before yesterday
summary3 <- zz[,list(events_day_before_yesterday = .N ), by= c("d2", "COHORT")]
# merging the tables together
summary1.2 <- merge(summary1, summary2, by.x = c("DATE", "COHORT"), by.y = c("d1", "COHORT"), all = TRUE)
# merging the tables together to join on third summary table.
desired_table <- merge(summary1.2, summary3, by.x = c("DATE", "COHORT"), by.y = c("d2", "COHORT"), all = TRUE)
print(desired_table)
There has to be a more elegant way to do this?
My example here is trivial - in practice I may want to do this for many thousands records, and many time periods.
Upvotes: 2
Views: 93
Reputation: 7724
I think a more elegant way would be
long_zz <- melt(zz, id.vars = "COHORT")
new_zz <- dcast(long_zz, COHORT + value ~ variable, fun = length, drop = FALSE, fill = NA)
new_zz
# COHORT value DATE d1 d2
# 1: a 2000-01-01 1 NA NA
# 2: a 2000-01-02 1 1 NA
# 3: a 2000-01-03 1 1 1
# 4: a 2000-01-04 NA 1 1
# 5: a 2000-01-05 2 NA 1
# 6: a 2000-01-06 2 2 NA
# 7: a 2000-01-07 NA 2 2
# 8: a 2000-01-08 NA NA 2
# 9: b 2000-01-01 2 NA NA
# 10: b 2000-01-02 NA 2 NA
# 11: b 2000-01-03 1 NA 2
# 12: b 2000-01-04 2 1 NA
# 13: b 2000-01-05 NA 2 1
# 14: b 2000-01-06 2 NA 2
# 15: b 2000-01-07 NA 2 NA
# 16: b 2000-01-08 NA NA 2
# 17: c 2000-01-01 1 NA NA
# 18: c 2000-01-02 1 1 NA
# 19: c 2000-01-03 NA 1 1
# 20: c 2000-01-04 2 NA 1
# 21: c 2000-01-05 2 2 NA
# 22: c 2000-01-06 NA 2 2
# 23: c 2000-01-07 NA NA 2
# 24: c 2000-01-08 NA NA NA
# COHORT value DATE d1 d2
Here, I first transform the data from a wide format to a long one and then use caste to split the variables (DATE, d1, d2) into columns again, while counting the number of rows in each COHORT, value group for each variable.
Without drop = FALSE
, we would miss row 24, where no events happend for COHORT c.
You can set your names with
setnames(new_zz, c("value", "DATE", "d1", "d2"), c("DATE", "events_today","events_yesterday","events_day_before_yesterday"))
mircobenchmark
-results of your approach (merge) vs. mine (long_wide):
Unit: milliseconds
expr min lq mean median uq max neval
merge 14.740983 18.534281 31.88430 21.223305 31.830966 353.3662 100
long_wide 5.102077 6.411999 10.82941 7.130821 8.884161 117.7351 100
Upvotes: 5