LA_
LA_

Reputation: 20409

Merge based on 2 columns values - Check for duplicate key values error

I have two data tables with the following columns - ddate,fnumber,file,model and fnumber,ddate,model,model_id,file. I would like to update the first table with values from the second table matched by ddate and fnumber columns.

If I use merge:

dtPT <- merge(dtPT, dtAT, by = c("fnumber", "ddate"), all.x = TRUE)

then I get the following error -

Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), : Join results in 8568291 rows; more than 8537179 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I tried to search for duplicated records in dtAT:

setkeyv(dtAT, c("fnumber", "ddate"))
dtAT[duplicated(dtAT)]

but it returns 0 rows.

I also tried to use match instead, but it doesn't work as I expected (sample data is used below):

tPT <- "ddate,fnumber,file,model
2014-05-26,S71149,ps1.csv,320
2014-09-26,,ps2.csv,
2014-10-16,,ps3.csv," 

tAT <- 
"fnumber,ddate,model,model_id,file
S71149,2014-05-26,319,VU-BHP,as1.csv
S71149,2014-05-25,320,,as2.csv
S71149,2014-05-23,322,VU-BTP,as3.csv"

columnClasses <- c("POSIXct", "factor", "character", "factor")
dtPT <- read.csv(text=tPT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtPT <- as.data.table(dtPT)

columnClasses <- c("character", "POSIXct", "character", "factor", "factor")
dtAT <- read.csv(text=tAT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtAT <- as.data.table(dtAT)

dtPT$model_code <- dtAT[match(paste(dtAT$fnumber, dtAT$ddate), paste(dtPT$fnumber, dtPT$ddate)), dtAT$model] 

The code above in result assigns model_code values to all rows:

        ddate fnumber    file model model_code
1: 2014-05-26  S71149 ps1.csv   320        319
2: 2014-09-26      NA ps2.csv    NA        320
3: 2014-10-16      NA ps3.csv    NA        322

when 320 should assigned to the first row only:

        ddate fnumber    file model model_code
1: 2014-05-26  S71149 ps1.csv   320        319
2: 2014-09-26      NA ps2.csv    NA         NA
3: 2014-10-16      NA ps3.csv    NA         NA

Since for second and third rows there is no match by combination of ddate and fnumber.

(the merge, of course, works well with the example below, but returns the error with Production data)

Upvotes: 2

Views: 2068

Answers (1)

jangorecki
jangorecki

Reputation: 16697

I would like to update the first table with values from the second table

Instead of merge function there is a update on join feature in data.table, it will be faster as it updates by reference and don't need to materialize bigger dataset after joins.
The general syntax is

DT1[DT2, value := i.lookup_value]

So going to your example

library(data.table)

tPT <- "ddate,fnumber,file,model
2014-05-26,S71149,ps1.csv,320
2014-09-26,,ps2.csv,
2014-10-16,,ps3.csv," 

tAT <- "fnumber,ddate,model,model_id,file
S71149,2014-05-26,319,VU-BHP,as1.csv
S71149,2014-05-25,320,,as2.csv
S71149,2014-05-23,322,VU-BTP,as3.csv"

columnClasses <- c("POSIXct", "factor", "character", "factor")
dtPT <- read.csv(text=tPT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtPT <- as.data.table(dtPT)

columnClasses <- c("character", "POSIXct", "character", "factor", "factor")
dtAT <- read.csv(text=tAT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtAT <- as.data.table(dtAT)

setkeyv(dtPT, c("fnumber","ddate"))
setkeyv(dtAT, c("fnumber","ddate"))

dtPT[dtAT, model_code := i.model]
#         ddate fnumber    file model model_code
# 1: 2014-09-26      NA ps2.csv    NA         NA
# 2: 2014-10-16      NA ps3.csv    NA         NA
# 3: 2014-05-26  S71149 ps1.csv   320        319

If you get mentioned error you should start by checking duplicates but not on full rows but on your composite key. The simple code below.

dtPT[,.N,c("fnumber","ddate")][N>1L]
dtAT[,.N,c("fnumber","ddate")][N>1L]

Upvotes: 1

Related Questions