Reputation: 2491
Context: The last section of the amazing new Joins in data.table vignette explains how to use the merge sintax ( x[i,on=.(id),...]
) to update the LEFT table by reference (also described in this SO question). Here is a simplified example:
x = data.table(id = 1:5, newvar1=LETTERS[1:5])
#In practice x would have more vars: newvar2, ..., newvarN
i = data.table(id = 1:7, var1 = c('bla','ble','bli','blo','blu','blA','blS') )
#Updating the LEFT table by reference
x[i,on = .(id),
j = `:=`(id=id,var1=var1,newvar1=newvar1)]
#Result: the column var1 (from i) is added to x by reference
print(x)
> print(x)
id newvar1 var1
<int> <char> <char>
1: 1 A bla
2: 2 B ble
3: 3 C bli
4: 4 D blo
5: 5 E blu
I need to do the opposite: update i
by reference.
x = data.table(id = 1:5, newvar1=LETTERS[1:5])
#In practice x would have more vars: newvar2, ..., newvarN
i = data.table(id = 1:7,var1 = c('bla','ble','bli','blo','blu','blA','blS'))
#Right join then overwrite the content of i
i <- x[i,on = .(id)]
Is there a way to do that by reference?
In my application (and I assume for several DT heavy users), i
is the main dataset, with millions of observations and dozens of columns. That is, the dataset I want to keep adding new variables to, such as newvar1
,newvar2
,...,newvarN
coming from x
. In doing that, I need to preserve the columns and rows (cardinality) of i
, not x
.
Overwriting i
, AFAIK, involves making a copy of i
, which causes a memory spikes (~ doubles) which may exceed the available ram and crash R.
My current workaround is this:
i_id <- i[,.(id)] # create an auxiliary data.table with just the 'id' var
i_id <- x[i_id,on=id] #the still making the copy, but with just the 'id' columns + columns from d (instead of hundreds of columns from i)
i[,newvar1:=i_id$newvar1]
i[,newvar2:=i_id$newvar2]
...
i[,newvarN:=i_id$newvarN] #This works, but the code is quite large (if N is large), verbose and error prone
Update1: I added a related issue on data.table's github
Upvotes: 0
Views: 65
Reputation: 33603
Would this work?
cols <- c("newvar1", "newvar2") # or setdiff(names(x), "id")
i[, (cols) := x[.SD, on = "id", .SD, .SDcols = cols]]
# or to avoid the double .SD.
# i[, (cols) := x[i, on = "id", .SD, .SDcols = cols]]
# i[, (cols) := x[.SD, on = "id", mget(cols)]]
i
# id var1 newvar1 newvar2
# <int> <char> <char> <char>
# 1: 1 bla A F
# 2: 2 ble B G
# 3: 3 bli C H
# 4: 4 blo D I
# 5: 5 blu E J
# 6: 6 blA <NA> <NA>
# 7: 7 blS <NA> <NA>
Using
x <- data.table(id = 1:5, newvar1 = LETTERS[1:5], newvar2 = LETTERS[6:10])
Upvotes: 4