Update dataframe from another dataframe

Question

I have 2 tables, "transactions" with over 500M rows and "Customers" over 3M rows

data <- data.frame(Trans = c(1,2,3,4,5), Cust01 = c("A","B","C","D","F"),
                   Cust02 = c("S","E","","TE","F"), Cust03 = c("F","","D","","F"))

cust_type <-data.frame(Cust = c("A","B","C","D"), Type = c("1","2","3","4"))

dataresult <- data.frame(Trans = c(1,2,3,4,5),
                         Cust01 = c("A","B","C","D","F"), 
                         Cust01Type = c("1","2","3","4","5"),
                         Cust02 = c("S","E","","TE","F"), 
                         Cust02Type = c("","","","",""),
                         Cust03 = c("F","","D","","F"),
                         Cust03Type = c("","","4","",""))

I would like to add the customer type to the data in an efficient way. Normally with sql I will use multiple left join, I tried that with dplyr but takes forever. I also tried to use %in% with logic return and then a loop just to focus on the true values. Does someone know a better way to do this?

Jaap · Accepted Answer

When you want fast performance, nothing beats the data.table package (yet). As your transaction data are now in wide format, the first step to do is convert it to long format. This will make it easier to process.

library(data.table) #v1.9.5
trans_data <- melt(setDT(data), id.vars = "Trans",
                   variable.name = "Cust",               # set name variable column
                   variable.factor = TRUE,               # set as a factor variable instead of a character variable
                   value.name = "Cvalue")[!Cvalue==""]   # set name value column & remove empty cases

When you have done that, you can join the two datatables:

# set the keys by which you are joining
setDT(trans_data, key = "Cvalue")
setDT(cust_type, key = "Cust")

# join the customer type into the transaction data
trans_data[cust_type, Ctype:=Type]

this gives:

> trans_data
    Trans   Cust Cvalue Ctype
 1:     1 Cust01      A     1
 2:     2 Cust01      B     2
 3:     3 Cust01      C     3
 4:     4 Cust01      D     4
 5:     3 Cust03      D     4
 6:     2 Cust02      E    NA
 7:     5 Cust01      F    NA
 8:     5 Cust02      F    NA
 9:     1 Cust03      F    NA
10:     5 Cust03      F    NA
11:     1 Cust02      S    NA
12:     4 Cust02     TE    NA

If you want to change the order in the resulting data.table, you can do that with for example:

setorder(trans_data, Trans, Cust)

or all at once with:

trans_data <- trans_data[cust_type, Ctype:=Type][order(Trans,Cust)]

which gives:

> trans_data
    Trans   Cust Cvalue Ctype
 1:     1 Cust01      A     1
 2:     1 Cust02      S    NA
 3:     1 Cust03      F    NA
 4:     2 Cust01      B     2
 5:     2 Cust02      E    NA
 6:     3 Cust01      C     3
 7:     3 Cust03      D     4
 8:     4 Cust01      D     4
 9:     4 Cust02     TE    NA
10:     5 Cust01      F    NA
11:     5 Cust02      F    NA
12:     5 Cust03      F    NA

Update dataframe from another dataframe

Answers (1)

Related Questions