Merge based on 2 columns values - Check for duplicate key values error

Question

I have two data tables with the following columns - ddate,fnumber,file,model and fnumber,ddate,model,model_id,file. I would like to update the first table with values from the second table matched by ddate and fnumber columns.

If I use merge:

dtPT <- merge(dtPT, dtAT, by = c("fnumber", "ddate"), all.x = TRUE)

then I get the following error -

Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), : Join results in 8568291 rows; more than 8537179 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I tried to search for duplicated records in dtAT:

setkeyv(dtAT, c("fnumber", "ddate"))
dtAT[duplicated(dtAT)]

but it returns 0 rows.

I also tried to use match instead, but it doesn't work as I expected (sample data is used below):

tPT <- "ddate,fnumber,file,model
2014-05-26,S71149,ps1.csv,320
2014-09-26,,ps2.csv,
2014-10-16,,ps3.csv," 

tAT <- 
"fnumber,ddate,model,model_id,file
S71149,2014-05-26,319,VU-BHP,as1.csv
S71149,2014-05-25,320,,as2.csv
S71149,2014-05-23,322,VU-BTP,as3.csv"

columnClasses <- c("POSIXct", "factor", "character", "factor")
dtPT <- read.csv(text=tPT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtPT <- as.data.table(dtPT)

columnClasses <- c("character", "POSIXct", "character", "factor", "factor")
dtAT <- read.csv(text=tAT, header = TRUE, sep = ",", na.strings = c(""), colClasses = columnClasses)
dtAT <- as.data.table(dtAT)

dtPT$model_code <- dtAT[match(paste(dtAT$fnumber, dtAT$ddate), paste(dtPT$fnumber, dtPT$ddate)), dtAT$model]

The code above in result assigns model_code values to all rows:

        ddate fnumber    file model model_code
1: 2014-05-26  S71149 ps1.csv   320        319
2: 2014-09-26      NA ps2.csv    NA        320
3: 2014-10-16      NA ps3.csv    NA        322

when 320 should assigned to the first row only:

        ddate fnumber    file model model_code
1: 2014-05-26  S71149 ps1.csv   320        319
2: 2014-09-26      NA ps2.csv    NA         NA
3: 2014-10-16      NA ps3.csv    NA         NA

Since for second and third rows there is no match by combination of ddate and fnumber.

(the merge, of course, works well with the example below, but returns the error with Production data)

Merge based on 2 columns values - Check for duplicate key values error

Answers (1)

Related Questions