Brian Stamper
Brian Stamper

Reputation: 2263

data.table replace data using values from another data.table, conditionally

This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.

What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.

For a reproducible example, here are the data. In the real world the tables are much larger:

library(data.table)
set.seed(0)

## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)

for(i in 2015:2017) {
  varA <- paste0('varA_', i)
  varB <- paste0('varB_', i)
  varC <- paste0('varC_', i)

  dt_original[, (varA) := rnorm(20)]
  dt_original[, (varB) := rnorm(20)]
  dt_original[, (varC) := rnorm(20)]
}

## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)

newdata_vars <- sample(names(dt_original)[-1], 4)

for(var in newdata_vars) {
  dt_newdata[, (var) := rnorm(3)]
}

Here is a way of doing it using a loop and pmax, but there has to be a better way, right?

for(var in newdata_vars) {
  k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
  dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}

It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.

Upvotes: 1

Views: 4838

Answers (1)

lmo
lmo

Reputation: 38500

This code should work in the current format given your criteria.

dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]

It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.

Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to @brian-stamper for pointing out the key preservation in the comments.

dt_original[dt_newdata,
            names(dt_newdata)[-1] := Map(pmax,
                                         mget(names(dt_newdata)[-1]),
                                         dt_newdata[, .SD, .SDcols=-1])]

Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.

Upvotes: 4

Related Questions