Airmoi
Airmoi

Reputation: 35

Relookup values in data.table based on lookup table

I'm trying to achieve a relookup on a very large data sets (~25M rows, ~3000 cols), based on a large lookuptable as well (~15M rows) I need to change only matching values and keep non matching unchanged

here is a data sample

Lookup table

source  target
A       1
B       2
C       3
D       4
...     ...

Source datas (I don't know cols count until my table is loaded)

col1    col2    col3    ...     coln
B       C       A       ...     ...
78      A       D       ...     ...
A       B       24      ...     ...
...     ...     ...     ...     ...

Expected result

col1    col2    col3    ...     coln
2       3       1       ...     ...
78      1       4       ...     ...
1       2       24      ...     ...
...     ...     ...     ...     ...

I've been able to achieve this using nested loops but :

  1. This is very slow
  2. I know R is smarter than that

I've found some posts with an equivalent problematic, but none of the solutions seems to works in my case.

Any suggestions ?

Thanks

(Tried different solutions as explain here and here without success)

Upvotes: 1

Views: 760

Answers (2)

Uwe
Uwe

Reputation: 42592

According to the OP, both data objects are quite large (25 M rows x 3000 cols and a lookup table of 15 M rows). Therefore, I suggest to avoid copying.

This can be achieved by using data.table's update on join which modifies only selected values in place, i.e., without copying the whole data object.

library(data.table)
options(datatable.print.class = TRUE)
address(data_set)
# loop over all columns
for (col in names(data_set))
  # update on join
  data_set[lookup, on = paste0(col, "==source"), (col) := target]
address(data_set)
data_set[]
     col1   col2   col3
   <char> <char> <char>
1:      2      3      1
2:     78      1      4
3:      1      2     24

The calls to address(data_set) are not required but to verify that data_set isn't copied. For performance reasons, it is important that the target column of lookup does have the same type as all of the columns in data_set ("character") to avoid repeated type conversions.

Data

library(data.table)
options(datatable.print.class = TRUE)
lookup <- fread("source  target
A       1
B       2
C       3
D       4", colClasses = "character")

Note that the parameter colClasses = "character" ensures that `target``is of type "character".

data_set <- fread("col1    col2    col3
B       C       A
78      A       D
A       B       24")

Upvotes: 2

www
www

Reputation: 39184

A solution using dplyr and tidyr. The idea is to reshape the data frame from wide format and then perform a join based on the values in the data frame and the lookup table, and then convert the format back.

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  mutate(ID = 1:n()) %>%
  gather(Column, Value, -ID) %>%
  left_join(dt_lookup, by = c("Value" = "source")) %>%
  mutate(target = as.numeric(ifelse(is.na(target), Value, target))) %>%
  select(-Value) %>%
  spread(Column, target) %>%
  select(-ID)
dt2
#   col1 col2 col3
# 1    2    3    1
# 2   78    1    4
# 3    1    2   24

Or you can also use the data.table approach with the same strategy. dt4 is the final output.

library(data.table)

# Convert the data frame to data table
setDT(dt, keep.rownames = TRUE) 
setDT(dt_lookup)

# Reshape the data table from wide to long
dt2 <- melt(dt, id.vars = "rn")
# Set the key for the join operation
setkey(dt2, value)
setkey(dt_lookup, source)
# Perform join and replace the values
dt3 <- dt_lookup[dt2][, target := as.numeric(ifelse(is.na(target), source, target))]
# Reshape the data table from long to wide
dt4 <- dcast(dt3, rn ~ variable, value.var = "target")[, rn := NULL]

DATA

dt_lookup <- read.table(text = "source  target
A       1
                        B       2
                        C       3
                        D       4",
                        header = TRUE, stringsAsFactors = FALSE)

dt <- read.table(text = "col1    col2    col3
B       C       A   
                 78      A       D 
                 A       B       24",
                 header = TRUE, stringsAsFactors = FALSE)

Upvotes: 2

Related Questions