jangorecki
jangorecki

Reputation: 16697

Add column by reference on rolling join

The main problem is described in the question title. Going straight to the example below.

I have two datasets:

library(data.table)
dt1 <- data.table(date = as.Date("2015-06-28")+c(0L,3L,5L,7L),
                  key="date")
dt2 <- data.table(date = as.Date("2015-06-30")+c(0:1,4L),
                  val = letters[7:9],
                  dummy = rep(NA,3),
                  key="date")

I want to have val column from dt2 added to dt1 using rolling join.
The following statement will produce similar output to the expected one:

dt2[dt1, roll=TRUE]
#          date val dummy
# 1: 2015-06-28  NA    NA
# 2: 2015-07-01   h    NA
# 3: 2015-07-03   h    NA
# 4: 2015-07-05   i    NA

The are two problems with that statement:
1. I didn't want to have dummy column
2. I want to do this by reference:

address(dt1)
# [1] "0x3b57540"
address(dt2[dt1, roll=TRUE])
# [1] "0x3b4e1f0"

So, I'm looking for rolling join and add column by reference for my dt1 and dt2, expected output:

#          date val
# 1: 2015-06-28  NA
# 2: 2015-07-01   h
# 3: 2015-07-03   h
# 4: 2015-07-05   i

And of course address(dt1) should match to address of the magic statement.

Upvotes: 5

Views: 160

Answers (2)

Clayton Stanley
Clayton Stanley

Reputation: 7784

Here's a method that should scale.

address(dt1)
# [1] "0x265a060"
ix = dt2[dt1, roll=TRUE, which=TRUE]
dt1[, val := dt2[ix, val]]
dt1
#          date val
# 1: 2015-06-28  NA
# 2: 2015-07-01   h
# 3: 2015-07-03   h
# 4: 2015-07-05   i
address(dt1)
# [1] "0x265a060"

Or without creating the index, directly as:

dt1[, val := dt2[dt1, val, roll = TRUE]] ## (1)

Note that this is more efficient than doing:

dt1[, val := dt2[dt1, roll = TRUE]$val] ## (2)

(2) performs the entire join (materialises all the columns) and then extracts val, where as (1) directly just extracts column val.

Upvotes: 4

MichaelChirico
MichaelChirico

Reputation: 34753

Here's a workaround; not perfect because it still involves creating extra memory, though there may be a way to avoid this that's not coming to me right away (anyway the extra memory may be minimal):

> address(dt1)
[1] "0x57b5230"

rng<-range(dt1[,range(date)],dt2[,range(date)])

x<-data.table(date=seq(from=rng[1],to=rng[2],by="day"),
              key="date")

> address(x)
[1] "0x6aa2df0"

x[dt2,setdiff(names(dt2),"date"):=mget(setdiff(names(dt2),"date"))
  ][,val:=zoo::na.locf(val,na.rm=F)]

> address(x)
[1] "0x6aa2df0"

> dt1[x,val:=i.val][]
         date val
1: 2015-06-28  NA
2: 2015-07-01   h
3: 2015-07-03   h
4: 2015-07-05   i

> address(dt1)
[1] "0x57b5230"

Upvotes: 1

Related Questions