Reputation: 164
I'm trying to aggregate two data frames (df1
and df2
).
The first contains 3 variables: ID
, Date1
and Date2
.
df1
ID Date1 Date2
1 2016-03-01 2016-04-01
1 2016-04-01 2016-05-01
2 2016-03-14 2016-04-15
2 2016-04-15 2016-05-17
3 2016-05-01 2016-06-10
3 2016-06-10 2016-07-15
The second also contains 3 variables: ID
, Date3
and Value
.
df2
ID Date3 Value
1 2016-03-15 5
1 2016-04-04 7
1 2016-04-28 7
2 2016-03-18 3
2 2016-03-27 5
2 2016-04-08 9
2 2016-04-20 2
3 2016-05-05 6
3 2016-05-25 8
3 2016-06-13 3
The idea is to get, for each df1
row, the sum of df2$Value
that have the same ID
and for which Date3
is between Date1
and Date2
:
ID Date1 Date2 SumValue
1 2016-03-01 2016-04-01 5
1 2016-04-01 2016-05-01 14
2 2016-03-14 2016-04-15 17
2 2016-04-15 2016-05-17 2
3 2016-05-01 2016-06-10 14
3 2016-06-10 2016-07-15 3
I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table
, plyr
and dplyr
but could not find a solution.
Upvotes: 6
Views: 496
Reputation: 118799
With the recently implemented non-equi
joins feature in the current development version of data.table, v1.9.7
, this can be done as follows:
dt2[dt1, .(sum = sum(Value)), on=.(ID, Date3>=Date1, Date3<=Date2), by=.EACHI]
# ID Date3 Date3 sum
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
The column names needs some fixing.. will work on it later.
Upvotes: 4
Reputation: 26258
A couple of data.table
solutions that should scale well (and a good stop-gap until non-equi joins are implemented):
Do the comparison in J using by=EACHI
.
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df1[ df2,
{
idx = Date1 <= i.Date3 & i.Date3 <= Date2
.(Date1 = Date1[idx],
Date2 = Date2[idx],
Date3 = i.Date3,
Value = i.Value)
},
on=c("ID"),
by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
foverlap
join (as suggested in the comments)
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df2[, Date4 := Date3]
setkey(df1, ID, Date1, Date2)
foverlaps(df2,
df1,
by.x=c("ID", "Date3", "Date4"),
type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
Further reading
Rolling join on data.table with duplicate keys
Upvotes: 5
Reputation: 35314
Here's a base R solution using sapply()
:
df1 <- data.frame(ID=c(1L,1L,2L,2L,3L,3L),Date1=as.Date(c('2016-03-01','2016-04-01','2016-03-14','2016-04-15','2016-05-01','2016-06-01')),Date2=as.Date(c('2016-04-01','2016-05-01','2016-04-15','2016-05-17','2016-06-15','2016-07-15')));
df2 <- data.frame(ID=c(1L,1L,1L,2L,2L,2L,2L,3L,3L,3L),Date3=as.Date(c('2016-03-15','2016-04-04','2016-04-28','2016-03-18','2016-03-27','2016-04-08','2016-04-20','2016-05-05','2016-05-25','2016-06-13')),Value=c(5L,7L,7L,3L,5L,9L,2L,6L,8L,3L));
cbind(df1,SumValue=sapply(seq_len(nrow(df1)),function(ri) sum(df2$Value[df1$ID[ri]==df2$ID & df1$Date1[ri]<=df2$Date3 & df1$Date2[ri]>df2$Date3])));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3
Note that your df1
and expected output have slightly different dates in some cases; I used the df1
dates.
Here's another approach that attempts to be more vectorized: Precompute a cartesian product of indexes into the two frames, then perform a single vectorized conditional expression using the index vectors to get matching pairs of indexes, and finally use the matching indexes to aggregate the desired result:
cbind(df1,SumValue=with(expand.grid(i1=seq_len(nrow(df1)),i2=seq_len(nrow(df2))),{
x <- df1$ID[i1]==df2$ID[i2] & df1$Date1[i1]<=df2$Date3[i2] & df1$Date2[i1]>df2$Date3[i2];
tapply(df2$Value[i2[x]],i1[x],sum);
}));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3
Upvotes: 1