Data.table join with variable column names

Question

I'm needing to iteratively perform joins between two data.tables where the column names are variables which I'm inputting from a function. I've been performing the joins using data.tables 'on' functionality, and am running into issues as the variable column names don't seem to be recognised.

For example, say we have two tables, Table_1 and Table_2, as follows:

require(data.table)
n <- 20
Table_1 <- data.table(A = seq_len(n) + 1,
               B = seq_len(n) + 3,
               C = seq_len(n) + 5)

m <- 15
Table_2 <- data.table(D = seq_len(m) + 7,
               E = seq_len(m) + 9,
               F = seq_len(m) + 12)

I can easily perform joins where I define the columns explicitly. e.g.

Table_2[Table_1,on = .(F = C),sum(D.na.rm = T)]

However, what I need to do is to perform multiple matches on various columns such as this:

require(purrr)    
pmap(.l = CJ(x = c("D","F"),y = c("A","B")),
     .f = function(x,y) Table_2[Table_1,on = .(x = y),sum(C,na.rm = T)])

I receive the following error:

Error in colnamesInt(x, names(on), check_dups = FALSE) : 
  argument specifying columns specify non existing column(s): cols[1]='x'

I've tried various things, such as:

Enclosing x and y with "eval()" or "noquote"
Putting the pmap function within the data.table, rather than outside as shown above.

Neither approaches work. Any assistance would be greatly appreciated as it will obviously be extremely inefficient to have to write out separate join statements!

Thanks, Phil

EDIT:

It was suggested below that I should consider using the "merge" function. In theory, this would work for the above example, however I didn't mention above that I actually need to use non-equi joins, meaning that, as far as I'm aware, I can't use "merge". In my real-world case, there will be combinations of equi and non-equi joins that i need to map column names to via a function.

I've provided a follow-up example with target output. The example only has two join statements, but I'd need the solution to be flexible enough to handle multiple:

I want the following expression:

pmap(.l = list(x1 = "D",x2 = "A",x3 = "E",x4 = "B"),
    .f = function(x1,x2,x3,x4) (Table_2[Table_1,on = .(x1 = x2,
                             x3 > x4),sum(C,na.rm = T)]))

To give the same output as this:

Table_2[Table_1,on = .(D = A,
                       E > B),sum(C,na.rm = T)]

i.e. 310 in this example.

Thanks again, Phil

Phil · Accepted Answer

I just figured out how to do this through trial and error:

  pmap(.l = list(x1 = "D",x2 = "A",x3 = "E",x4 = "B"),
       .f = function(x1,x2,x3,x4) (Table_2[Table_1,on = 
                            c(paste0(x1,"==",x2),paste0(x3,">",x4)),
                                           sum(C,na.rm = T)]))

Data.table join with variable column names

Answers (1)

Related Questions