Reputation: 14604
I have two data tables: a
and b
a = structure(list(id = c(86246, 86252, 12262064), brand = c(3718L,
13474L, 17286L), offerdate = structure(c(15454, 15791, 15883), class = "Date")), .Names = c("id",
"brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>)
b = structure(list(id = c(86246, 86246, 86246), brand = c(3718, 3718,
875), date = structure(c(15408, 15430, 15434), class = "Date")), .Names = c("id",
"brand", "date"), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>)
> a
id brand offerdate
1: 86246 3718 2012-04-24
2: 86252 13474 2013-03-27
3: 12262064 17286 2013-06-27
> b
id brand date
1: 86246 3718 2012-03-09
2: 86246 3718 2012-03-31
3: 86246 875 2012-04-04
Now I would like, for each id in a, to count the number of rows in b for the same id and brand, with a date less than 30 days before the a.offerdate.
The outcome I wish to have is an updated a:
> a
id brand offerdate nbTrans_last_30_days
1: 86246 3718 2013-04-24 1
2: 86252 13474 2013-03-27 0
3: 12262064 17286 2013-06-27 0
I can do the job with subset, but I am looking for a fast solution. The subset version would be to do (for each line of a):
subset(b, (id == 86246) & (brand == 3718) & (date > as.Date("2012-03-24")) )
with the date depending on a.offerdate
.
I manage to count the total rows in b:
> setkey(a,id, brand)
> setkey(b,id, brand)
> a = a[b[a, .N]]
> setnames(a, "N", "nbTrans")
> a
id brand offerdate nbTrans
1: 86246 3718 2012-04-24 2
2: 86252 13474 2013-03-27 0
3: 12262064 17286 2013-06-27 0
but I do not know how to handle the comparison of dates between the two tables.
The answer below works for the original small data set, but somehow did not work for my real data. I tried to reproduce the problem with two new variables: a2 and b2
a2=structure(list(id = c(86246, 86252, 12262064), brand = structure(c(3L,
+ 9L, 12L), .Label = c("875", "1322", "3718", "4294", "5072", "6732",
+ "6926", "7668", "13474", "13791", "15889", "17286", "17311",
+ "26189", "26456", "28840", "64486", "93904", "102504"), class = "factor"),
+ offerdate = structure(c(15819, 15791, 15883), class = "Date")), .Names = c("id",
+ "brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table",
+ "data.frame"))
b2=structure(list(id = c(86246, 86246, 86246, 86246, 86246, 86246,
+ 86246, 86246), brand = c(3718L, 3718L, 3718L, 3718L, 3718L, 3718L,
+ 3718L, 3718L), date = structure(c(15423, 15724, 15752, 15767,
+ 15782, 15786, 15788, 15811), class = "Date")), .Names = c("id",
+ "brand", "date"), sorted = c("id", "brand"), class = c("data.table",
+ "data.frame"))
> setkey(a2,id,brand)
> setkey(b2,id,brand)
> merge(a2, b2, all.x = TRUE, allow.cartesian = TRUE)
id brand offerdate date
1: 86246 3718 2013-04-24 <NA>
2: 86252 13474 2013-03-27 <NA>
3: 12262064 17286 2013-06-27 <NA>
The problem is that the merge does not keep the b2.date information.
Upvotes: 1
Views: 907
Reputation: 5536
The trick is to use allow.cartesian
argument in merge
:
setkey(a, id, brand)
setkey(b, id, brand)
c <- merge(a, b, all.x = T, allow.cartesian = T)
c[, Trans := (offerdate - date) <= 30]
c[, list(nbTrans_last_30_days = sum(Trans, na.rm = T)),
keyby = list(id, brand, offerdate)]
Upvotes: 2