Reputation: 97
Suppose I have two data.tables:
summary <- data.table(period = c("A","B","C","D"),
from_date = ymd(c("2017-01-01", "2017-01-03", "2017-02-08", "2017-03-07")),
to_date = ymd(c("2017-01-31", "2017-04-01", "2017-03-08", "2017-05-01"))
)
log <- data.table(date = ymd(c("2017-01-03","2017-01-20","2017-02-01","2017-03-03",
"2017-03-15","2017-03-28","2017-04-03","2017-04-23")),
event1 = c(4,8,8,4,3,4,7,3), event2 = c(1,8,7,3,8,4,6,3))
which look like this:
> summary
period from_date to_date
1: A 2017-01-01 2017-01-31
2: B 2017-01-03 2017-04-01
3: C 2017-02-08 2017-03-08
4: D 2017-03-07 2017-05-01
> log
date event1 event2
1: 2017-01-03 4 1
2: 2017-01-20 8 8
3: 2017-02-01 8 7
4: 2017-03-03 4 3
5: 2017-03-15 3 8
6: 2017-03-28 4 4
7: 2017-04-03 7 6
8: 2017-04-23 3 3
I would like to get the sum of event1 and event2 for each time period in the table summary.
I know I can do this:
summary[, c("event1","event2") := .(sum(log[date>=from_date & date<=to_date, event1]),
sum(log[date>=from_date & date<=to_date, event2]))
, by=period][]
to get the desired result:
period from_date to_date event1 event2
1: A 2017-01-01 2017-01-31 12 9
2: B 2017-01-03 2017-04-01 31 31
3: C 2017-02-08 2017-03-08 4 3
4: D 2017-03-07 2017-05-01 17 21
Now, in my real-life problem, I have about 30 columns to be summed, which I may want to change later, and summary has ~35,000 rows, log has ~40,000,000 rows. Is there an efficient way to achieve this?
Note: This is my first post here. I hope my question is clear and specific enough, please do make suggestions if there is anything I should do to improve the question. Thanks!
Upvotes: 5
Views: 1032
Reputation: 42544
With data.table
, it is possible to aggregate during a non-equi join using by = .EACHI
.
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
lapply(.SD, sum), by = .EACHI]
date date event1 event2 1: 2017-01-01 2017-01-31 12 9 2: 2017-01-03 2017-04-01 31 31 3: 2017-02-08 2017-03-08 4 3 4: 2017-03-07 2017-05-01 17 21
With some additional clean-up:
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
c(period = period, lapply(.SD, sum)), by = .EACHI][
, setnames(.SD, 1:2, c("from_date", "to_date"))]
from_date to_date period event1 event2 1: 2017-01-01 2017-01-31 A 12 9 2: 2017-01-03 2017-04-01 B 31 31 3: 2017-02-08 2017-03-08 C 4 3 4: 2017-03-07 2017-05-01 D 17 21
Upvotes: 4
Reputation: 16090
Yes, you can perform a non-equi join.
(Note I've changed log
and summary
to Log
and Summary
as the originals are already functions in R.)
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch=0L,
allow.cartesian = TRUE][, .(from_date = date[1],
to_date = date.1[1],
event1 = sum(event1),
event2 = sum(event2)),
keyby = "period"]
To sum over a pattern of columns, use lapply
with .SD
:
joined_result <-
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch = 0L,
allow.cartesian = TRUE]
cols <- grep("event[a-z]?[0-9]", names(joined_result), value = TRUE)
joined_result[, lapply(.SD, sum),
.SDcols = cols,
keyby = .(period,
from_date = date,
to_date = date.1)]
Upvotes: 5