R data.table update join by reference the, but updating the RIGHT table

Question

Context: The last section of the amazing new Joins in data.table vignette explains how to use the merge sintax ( x[i,on=.(id),...]) to update the LEFT table by reference (also described in this SO question). Here is a simplified example:

x = data.table(id = 1:5, newvar1=LETTERS[1:5])
               #In practice x would have more vars: newvar2, ..., newvarN
 i = data.table(id = 1:7, var1 = c('bla','ble','bli','blo','blu','blA','blS') )

#Updating the LEFT table by reference
x[i,on = .(id),
  j = `:=`(id=id,var1=var1,newvar1=newvar1)]

#Result: the column var1 (from i) is added to x by reference
print(x)
> print(x)
id newvar1   var1
  
1:     1      A    bla
2:     2      B    ble
3:     3      C    bli
4:     4      D    blo
5:     5      E    blu

I need to do the opposite: update i by reference.

x = data.table(id = 1:5, newvar1=LETTERS[1:5])
               #In practice x would have more vars: newvar2, ..., newvarN 
i = data.table(id = 1:7,var1 = c('bla','ble','bli','blo','blu','blA','blS'))

#Right join then overwrite the content of i
i <- x[i,on = .(id)]

Is there a way to do that by reference?

In my application (and I assume for several DT heavy users), i is the main dataset, with millions of observations and dozens of columns. That is, the dataset I want to keep adding new variables to, such as newvar1,newvar2,...,newvarN coming from x. In doing that, I need to preserve the columns and rows (cardinality) of i, not x.

Overwriting i, AFAIK, involves making a copy of i, which causes a memory spikes (~ doubles) which may exceed the available ram and crash R.

My current workaround is this:

i_id <- i[,.(id)]     # create an auxiliary data.table with just the 'id' var
i_id <- x[i_id,on=id] #the still making the copy, but with just the 'id' columns + columns from d (instead of hundreds of columns from i) 
i[,newvar1:=i_id$newvar1]
i[,newvar2:=i_id$newvar2]
...
i[,newvarN:=i_id$newvarN]   #This works, but the code is quite large (if N is large), verbose and error prone

Update1: I added a related issue on data.table's github

s_baldur · Accepted Answer

Would this work?

cols <- c("newvar1", "newvar2") # or setdiff(names(x), "id")
i[, (cols) := x[.SD, on = "id", .SD, .SDcols = cols]]
# or to avoid the double .SD.
# i[, (cols) := x[i, on = "id", .SD, .SDcols = cols]]
# i[, (cols) := x[.SD, on = "id", mget(cols)]] 

i
#       id   var1 newvar1 newvar2
#         
# 1:     1    bla       A       F
# 2:     2    ble       B       G
# 3:     3    bli       C       H
# 4:     4    blo       D       I
# 5:     5    blu       E       J
# 6:     6    blA        
# 7:     7    blS

Using

x <- data.table(id = 1:5, newvar1 = LETTERS[1:5], newvar2 = LETTERS[6:10])

R data.table update join by reference the, but updating the RIGHT table

Answers (1)

Related Questions