djas
djas

Reputation: 1023

How to do update-by-reference-join?

I am trying to perform a cartesian join of two data tables. Because my tables are large (10s of Gbs), I need to do this in a memory-efficient and way. So I thought I would update one of the tables by reference in a data.table-centric way.

Here's what I am trying to achieve:

# fake data
A = data.table(id = c(1, 1, 1, 2, 2, 3),  
               X = 0:5, 
               Y = letters[1:6], 
               key = 'id')

B = data.table(id = rep(1:3, each = 3) ,  
               time = rep(1:3, 3), 
               Z = 8:0, 
               key = c('id', 'time'))

# the desired join: cartesian join with columns from both tables
B[A, on = 'id', allow.cartesian = T]

However, no tables get updated in this solution. And I want to avoid creating a 3rd table, for memory reasons.

How can I achieve this by updating one of the tables (A or B) by reference?

Partially inspired by the answers to this question, here's what I have tried without success:

# table B gets the new columns, but not the new rows :(
B[A, allow.cartesian = T, `:=` (X = i.X, Y = i.Y)]

As a bonus, it would be nice if I didn't have to spell out each and every column in the A table (as in := (X = i.X, Y = i.Y)), because I want to use all the columns in A.

Thoughts?

Upvotes: 5

Views: 422

Answers (1)

Billy34
Billy34

Reputation: 2214

There is something strange. According to your fake data that are assumed to be representative of your real data there are duplicate ids in both A and B.

Doing a cartesian merge of A and B is perfectly legit but will create a new data.table in memory.

If you want to do an update merge using := this can be done for one to many relations but not for many to many.

You can't say "update this line by putting multiple linked values in the same field !"

For many to many, you can update current line with aggregated values from linked values using by=.EACHI and aggregate functions such as sum, mean or .N

Upvotes: 3

Related Questions