Reputation: 20409
I have two data tables with the following columns - ddate,fnumber,file,model
and fnumber,ddate,model,model_id,file
. I would like to update the first table with values from the second table matched by ddate
and fnumber
columns.
If I use merge
:
dtPT <- merge(dtPT, dtAT, by = c("fnumber", "ddate"), all.x = TRUE)
then I get the following error -
Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), : Join results in 8568291 rows; more than 8537179 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including
j
and droppingby
(by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I tried to search for duplicated records in dtAT
:
setkeyv(dtAT, c("fnumber", "ddate"))
dtAT[duplicated(dtAT)]
but it returns 0 rows.
I also tried to use match
instead, but it doesn't work as I expected (sample data is used below):
tPT <- "ddate,fnumber,file,model
2014-05-26,S71149,ps1.csv,320
2014-09-26,,ps2.csv,
2014-10-16,,ps3.csv,"
tAT <-
"fnumber,ddate,model,model_id,file
S71149,2014-05-26,319,VU-BHP,as1.csv
S71149,2014-05-25,320,,as2.csv
S71149,2014-05-23,322,VU-BTP,as3.csv"
columnClasses <- c("POSIXct", "factor", "character", "factor")
dtPT <- read.csv(text=tPT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtPT <- as.data.table(dtPT)
columnClasses <- c("character", "POSIXct", "character", "factor", "factor")
dtAT <- read.csv(text=tAT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtAT <- as.data.table(dtAT)
dtPT$model_code <- dtAT[match(paste(dtAT$fnumber, dtAT$ddate), paste(dtPT$fnumber, dtPT$ddate)), dtAT$model]
The code above in result assigns model_code
values to all rows:
ddate fnumber file model model_code
1: 2014-05-26 S71149 ps1.csv 320 319
2: 2014-09-26 NA ps2.csv NA 320
3: 2014-10-16 NA ps3.csv NA 322
when 320
should assigned to the first row only:
ddate fnumber file model model_code
1: 2014-05-26 S71149 ps1.csv 320 319
2: 2014-09-26 NA ps2.csv NA NA
3: 2014-10-16 NA ps3.csv NA NA
Since for second and third rows there is no match by combination of ddate
and fnumber
.
(the merge
, of course, works well with the example below, but returns the error with Production data)
Upvotes: 2
Views: 2068
Reputation: 16697
I would like to update the first table with values from the second table
Instead of merge
function there is a update on join feature in data.table, it will be faster as it updates by reference and don't need to materialize bigger dataset after joins.
The general syntax is
DT1[DT2, value := i.lookup_value]
So going to your example
library(data.table)
tPT <- "ddate,fnumber,file,model
2014-05-26,S71149,ps1.csv,320
2014-09-26,,ps2.csv,
2014-10-16,,ps3.csv,"
tAT <- "fnumber,ddate,model,model_id,file
S71149,2014-05-26,319,VU-BHP,as1.csv
S71149,2014-05-25,320,,as2.csv
S71149,2014-05-23,322,VU-BTP,as3.csv"
columnClasses <- c("POSIXct", "factor", "character", "factor")
dtPT <- read.csv(text=tPT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtPT <- as.data.table(dtPT)
columnClasses <- c("character", "POSIXct", "character", "factor", "factor")
dtAT <- read.csv(text=tAT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtAT <- as.data.table(dtAT)
setkeyv(dtPT, c("fnumber","ddate"))
setkeyv(dtAT, c("fnumber","ddate"))
dtPT[dtAT, model_code := i.model]
# ddate fnumber file model model_code
# 1: 2014-09-26 NA ps2.csv NA NA
# 2: 2014-10-16 NA ps3.csv NA NA
# 3: 2014-05-26 S71149 ps1.csv 320 319
If you get mentioned error you should start by checking duplicates but not on full rows but on your composite key. The simple code below.
dtPT[,.N,c("fnumber","ddate")][N>1L]
dtAT[,.N,c("fnumber","ddate")][N>1L]
Upvotes: 1