Reputation: 35
I'm trying to achieve a relookup on a very large data sets (~25M rows, ~3000 cols), based on a large lookuptable as well (~15M rows) I need to change only matching values and keep non matching unchanged
here is a data sample
Lookup table
source target
A 1
B 2
C 3
D 4
... ...
Source datas (I don't know cols count until my table is loaded)
col1 col2 col3 ... coln
B C A ... ...
78 A D ... ...
A B 24 ... ...
... ... ... ... ...
Expected result
col1 col2 col3 ... coln
2 3 1 ... ...
78 1 4 ... ...
1 2 24 ... ...
... ... ... ... ...
I've been able to achieve this using nested loops but :
I've found some posts with an equivalent problematic, but none of the solutions seems to works in my case.
Any suggestions ?
Thanks
(Tried different solutions as explain here and here without success)
Upvotes: 1
Views: 760
Reputation: 42592
According to the OP, both data objects are quite large (25 M rows x 3000 cols and a lookup table of 15 M rows). Therefore, I suggest to avoid copying.
This can be achieved by using data.table
's update on join which modifies only selected values in place, i.e., without copying the whole data object.
library(data.table)
options(datatable.print.class = TRUE)
address(data_set)
# loop over all columns
for (col in names(data_set))
# update on join
data_set[lookup, on = paste0(col, "==source"), (col) := target]
address(data_set)
data_set[]
col1 col2 col3 <char> <char> <char> 1: 2 3 1 2: 78 1 4 3: 1 2 24
The calls to address(data_set)
are not required but to verify that data_set
isn't copied. For performance reasons, it is important that the target
column of lookup
does have the same type as all of the columns in data_set
("character") to avoid repeated type conversions.
library(data.table)
options(datatable.print.class = TRUE)
lookup <- fread("source target
A 1
B 2
C 3
D 4", colClasses = "character")
Note that the parameter colClasses = "character"
ensures that `target``is of type "character".
data_set <- fread("col1 col2 col3
B C A
78 A D
A B 24")
Upvotes: 2
Reputation: 39184
A solution using dplyr
and tidyr
. The idea is to reshape the data frame from wide format and then perform a join based on the values in the data frame and the lookup table, and then convert the format back.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
mutate(ID = 1:n()) %>%
gather(Column, Value, -ID) %>%
left_join(dt_lookup, by = c("Value" = "source")) %>%
mutate(target = as.numeric(ifelse(is.na(target), Value, target))) %>%
select(-Value) %>%
spread(Column, target) %>%
select(-ID)
dt2
# col1 col2 col3
# 1 2 3 1
# 2 78 1 4
# 3 1 2 24
Or you can also use the data.table
approach with the same strategy. dt4
is the final output.
library(data.table)
# Convert the data frame to data table
setDT(dt, keep.rownames = TRUE)
setDT(dt_lookup)
# Reshape the data table from wide to long
dt2 <- melt(dt, id.vars = "rn")
# Set the key for the join operation
setkey(dt2, value)
setkey(dt_lookup, source)
# Perform join and replace the values
dt3 <- dt_lookup[dt2][, target := as.numeric(ifelse(is.na(target), source, target))]
# Reshape the data table from long to wide
dt4 <- dcast(dt3, rn ~ variable, value.var = "target")[, rn := NULL]
DATA
dt_lookup <- read.table(text = "source target
A 1
B 2
C 3
D 4",
header = TRUE, stringsAsFactors = FALSE)
dt <- read.table(text = "col1 col2 col3
B C A
78 A D
A B 24",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 2